Cytogenetic Data in AceDB

kirbym at kirbym at
Tue May 10 10:27:33 EST 1994

Does anyone using acedb have cytogenetic data?
If so, do you find it handles this data adequately/some of it/not at all?
Are there any improvements that you would like to see?

Having tried to convert my data into acedb3 format it is still extremely 
frustrating to find that yet again, it is impossible to enter cytogenetic
data and get it to display on appropriate maps. It seems to me that very
little thought has actually been given to the representation of this data.
With the next workshop approaching (July 1994), this would be a good time
to get some improvements made that will be useful to everyone and, more
importantly, get them incorporated into the main version of acedb.

The following are based mainly on my own requirements. If these need to be
more generic, or you can see a better way to do the same thing, then please

The acedb philosophy is that "objects" displayed on a map are defined
either as a single Locus or as an Interval which is composed of two end
points or Loci. (Boston workshop) In practice, map locations are defined
in the ?map_location model and are defined numerically.

Mouse cytogenetic data is of two types. Genes with locations derived mainly
from hybridisation in situ (HIS) and somatic cell (SC) techniques, and
chromosomal anomalies (or rearrangements) with breakpoint data, eg insertions,
translocations, etc. Breakpoints can be regarded as Loci in acedb. If an
anomaly has two breakpoints on the same chromosome then the intervening
region forms a segment or interval. In all cases the location is given as
chromosome bands.

Improvements to Existing Models and Code Wanted
1.  The ?map_location model allow
        -  locations to be given as chromosome bands.
        -  end points be specified to indicate ranges
           Eg human Xq21.1-Xq27.3

    This means the computer works out the numeric coordinates required for
    placing loci and intervals on the map(s) from the Chrom_Band definitions.
    If the band is defined in terms of the tags Contains and Contained_in
    then the algorithmn will have to recurse until it can either get a
    numeric definition or fail.

2.  The computer find the missing bands in that range if the full complement
    of bands is needed elsewhere. (They usually are if the bands in the
    Chromosome column on the map(s) are to be highlighted correctly.)

3.  If a set of ?Chrom_Bands forms an idiogram for a chromosome, is there
    any way the same idiogram can be used on several maps without having
    to redefine the Chrom_Bands for each map?

At the very least then, the ?map_location model would be:

    ?map_location UNIQUE Position UNIQUE Float Float
                         Multi_Position  Float Float
                         Ends Left UNIQUE Float Float
                              Right UNIQUE Float Float
                         Chrom_Band ?Chrom_Band ?Chrom_Band
                         Junction ?Chrom_Band ?Chrom_Band

    where  Chrom_Band refers to a location within a band or a range of bands
     and   Junction refers to a point at the junction of two adjacent bands.
           (Yes, we do have this data.)

Except that Left and Right currently define the Interval end points and
should have Chrom_Band definitions as well. (Also, I am for ever being
told by the scientists at Harwell that Left and Right are meaningless and
to use the proper terms instead, ie Proximal and Distal.)

Problems with ?map_location Model
1.  Locations are numeric and have to be specified fairly accurately.
    Cytogenetic data is given as chromosome bands and is often imprecise.
    For example, a deletion is from mouse band 4E2 to the end of the
    chromosome (4E2-4ter).

2.  The data in the ?map_location model are separated from any supporting
    data in the rest of the parent model (eg Locus). This is not a problem
    if there is only one position in the map_location model and the parent
    model refers to only one location. However, nearly all chromosome
    anomalies have two or more breakpoints. Some of these will be listed in
    the same data structure, eg Translocations. Some have BOTH genetic and
    cytogenetic data which has to be linked to the correct location.

3.  Most of the confusion and problems that I have experienced stem
    from the fact that ?map_location is used to define BOTH single points
    on a map and intervals (two points) on a map.

If this proposal is going to cause immense work by changing some of the
fundamental data definitions, at least consider the possibility of creating
new models for cytogenetic data where these changes can be implemented.
For instance, instead of the model Interval there could be a CytoInterval.

The suggestion is that:
    i.  ?map_location be used for SINGLE locations only.
   ii.  All locations are defined by two points.
          - If they are the same, the location is a single point
          - If different, the location is a range
               (Apparently this would preempt the time when molecular
               information will be available and positions will be ranges
               on the physical map.)
  iii.  All Intervals are defined by pairs of double points.
          - If the end point is a range then either the mid point of the
               range would be used to draw the Interval, or, an error bar
               could be drawn to indicate the extent of the uncertainty.
          - If only ONE point is given then the end of the Interval is from
               the single point to the end of the chromosome. A dotted line
               continuing on from the main part of the Interval (solid line)
               could indicate the uncertainty at that end on the map.
   iv.  Intervals (or CytoIntervals) are composed of two Locus definitions
            which in turn access the ?map_location model.
    v.  All anomaly breakpoints be treated as Loci (ie use the model Locus).
   vi.  Can we agree on a common definition for Chromosome Anomalies

The model definitions would thus become:

    ?map_location  Position Float Float
                   Chrom_Band ?Chrom_Band ?Chrom_Band
                   Junction ?Chrom_Band ?Chrom_Band

    ?Locus Identifier UNIQUE Text
           Location Map ?Map XREF Locus #map_location
           ...      // rest of model definition 

    ?Interval Identifier UNIQUE Text
              Type ?Text             // anomaly type, eg deletion, etc.
              Location Map ?Map Proximal #Locus
                                Distal   #Locus
              Contains Locus ?Locus
                       Fragment ?Interval
              Contained_in ?Chrom_Anomaly XREF Interval

1. The new ?map_location model does not have to affect numeric positions
   for single Loci. The existing arrangement could continue. However, any
   chromosome band information would have to conform to the above schema
   and changes would be required to implement the new Interval or CytoInterval.

2. #Locus is preferable to ?Locus because in the majority of cases the Locus
   model will only be used for the map location data. For anomaly breakpoints
   which may have need of the rest of the Locus model it may still be
   preferable. It is easier to see all the data displayed together under one
   tree (model) rather than have to go to other trees to see different parts
   of the data. Besides I object to having to define x number of loci for the
   same anomaly (given that there could be many breakpoints involved) and
   don't see why they should be lumped in with genes and given identities of
   their own.

   HOWEVER, being in sub objects makes it difficult to get at the location
   data and there is an efficiency tradeoff. What will happen when the linkage
   data is added to each breakpoint?

3. Treating a set of Chrom_Bands as an idiogram will be necessary in order
   to find the end of the chromosome when the location is very uncertain.

4. For human chromosome anomalies which contain large repeats of data
   (such as Simon Mercer's data) Intervals can be defined as belonging to
   or containing other Intervals.
5. Chromosome Anomaly Definitions.
   Any definition for chromosomal anomalies needs to be able to cope with
   both the Long and Short nomenclature forms. The most simplistic model
   would be:

    ?Chrom_Anomaly Identifier UNIQUE Text
                   Type ?Text             // eg deletion, etc.
                   Chromosome Text
                   Location Cytogenetic_location Text
                            Contains Fragment ?Interval XREF Chrom_Anomaly
                                     Bkpt #Locus
                                     Foreign_gene ?Locus XREF Chrom_Anomaly
                   ...      // rest of model definition 

  -  With this model it is relatively easy for the computer to pick up the
     Intervals and Loci and display them on the map but some clarity
     regarding which breakpoints belong to which Type is lost.

  -  For complex anomalies involving several different types, how useful would
     it be to explicitly identify these? Part of a definition that I have been
     using does this:

     ?Chrom_Anomaly  Identifier        UNIQUE Text
                     Type    Deletion        ?Interval
                             Duplication     ?Interval
                             Insertion Donated_segment ?Interval
                                       Recipient_bkpt  #Locus
                             Inversion       ?Interval
                             Translocation   Bkpt1  #Locus
                                             Bkpt2  #Locus
                             HSR     #Locus  // Homogeneously Staining Region
                             Tertiary_nullisomic  Extra_seg1 ?Interval
                                                  Extra_seg2 ?Interval
                             Tertiary_trisomic    Missing_seg1 ?Interval
                                                  Missing_seg2 ?Interval
                             Transgene       #Locus
                                             Foreign_gene ?Locus
                             Transposition   ?Interval
                             Monosomy        ?Text
                             Trisomy         ?Text
                     ...      // rest of model definition 
  - The disadvantage with this definition is that the computer has a lot
    more work to do to find the map locations.

  - Heterochromatin's or Homogeneously Staining Regions (HSR's) are
    unidentifiable regions of foreign DNA whose size is often specified
    by + or ++ only. Would it be adequate to define their endpoints by
    those of the regions into which they have been inserted on the host
    chromosome? Please comment.

  - Representing circular chromosomes may be more difficult.

6.  The Location subtree is often used by several models, eg Locus,
    Interval, Chrom_Anomaly and in the Homology model developed by
    Jo Dicks and myself. Can we agree on a common definition for it?

I apologise for sending such a long letter to the bulletin board but it is
difficult to convey the complexities of dealing with chromosomal anomalies
in acedb. In defence of the charge that all of this should have been
discussed last November when Jean first put forward his proposals: well
I did send a detailed reply to the bulletin board (Nov 24 1993) but no one
took any notice. If any of the above is useful to other people please say so.

I have code that implements a small part of the above proposal but it needs
a significant amount of work to extend it to do all of the above. A stable
model definition would help. I would be happy to work on this but suspect
that it will also involve substantial changes to parts of the acedb code.
Merging new code into the main acedb code each time there is a new version
released is a real pain and I don't want to waste any more of my time on it.
It would be better if the authors of acedb were involved.
Many thanks to Simon Mercer for the recent helpful discussion.

Michelle Kirby     (email kirbym at

More information about the Acedb mailing list