Acedb February Newsletter

Ed Griffiths edgrif at sanger.ac.uk
Thu Mar 8 11:31:36 EST 2001


The February Newsletter, slightly late again, I'll get back on track soon.

Do remember that contributions to the Newsletter would be most welcome, topics
could be anything from "your favourite tip" to "your favourite bug" to "my wish
list". Just send me a few paragraphs and I'll put it in the next newsletter.

cheers Ed

 ------------------------------------------------------------------------
| Ed Griffiths, Acedb development, Informatics Group,                    |
|               The Sanger Centre, Wellcome Trust Genome Campus,         |
|               Hinxton, Cambridge CB10 1SA, UK                          |
|                                                                        |
| email: edgrif at sanger.ac.uk  Tel: +44-1223-494780  Fax: +44-1223-494919 |
 ------------------------------------------------------------------------

----------------------------------------------------------------------------
----------------------------------------------------------------------------

ACEDB User Group Newsletter - February 2001
###########################################

If you want to have this newsletter mailed to you _or_ you want to make
comments/suggestions about the format/content then send an email to
acedb at sanger.ac.uk.

This month sees the introduction of code to support representation of mRNA
exons and associated CDS in a single object rather then two as is currently
used in much of the human database. There are also various other features
such as interactive control of reporting of DNA mismatches while displaying
large links within fmap. There are also a number of important bug fixes.

New Features
************

Merging of "Supported mRNA" and "Supported CDS" objects for fmap
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

This applies particularly to the acedb human chromosome databases.

Previously, where a set of mRNA exons had a known CDS within them, this was
represented by two Sequence objects in the database. One object held the
positions of the mRNA exons, while the other held a subset of those exons
which represented the CDS. This is a hard to maintain because both objects
must be positioned correctly in their parent sequence object _AND_ their
exon coordinates must be kept in step with each other. It is also very
wasteful of space in the database since two objects with two sets of largely
overlapping data must be held in the database.

New code has been added to acedb to enable a single sequence object to
represent a set of mRNA exons and the CDS within those exons. The following
gives a brief example of how this is done.

Parent sequence:
Sequence : "bA404F10"
DNA      "bA404F10" 195976
......etc
Subsequence      "bA404F10.4"      126576 140451
Subsequence      "bA404F10.4.mRNA" 126535 142095
......etc

CDS object:
Sequence : "bA404F10.4"
Source   "bA404F10"
Source_Exons         1    55
Source_Exons      8893  9213
Source_Exons     10520 10792
Source_Exons     11003 11122
Source_Exons     12044 12147
Source_Exons     12885 12947
Source_Exons     13805 13876
CDS
......etc

mRNA object:
Sequence : "bA404F10.4.mRNA"
Source   "bA404F10"
Source_Exons         1    96
Source_Exons      8934  9254
Source_Exons     10561 10833
Source_Exons     11044 11163
Source_Exons     12085 12188
Source_Exons     12926 12988
Source_Exons     13846 15561
......etc

Note that the CDS object has the CDS tag set and that its exons are a strict
subset of the mRNA exons. The CDS tag can be followed by start/end
coordinates for the CDS but this is redundant here because the start and end
of the exons in the CDS object themselves show the start/end of the CDS.

New parent:
Sequence : "bA404F10"
DNA      "bA404F10" 195976
......etc
Subsequence      "bA404F10.4" 126535 142095
......etc

and single CDS/mRNA object:
Sequence : "bA404F10.4"
Source   "bA404F10"
Source_Exons     1 96
Source_Exons     8934 9254
Source_Exons     10561 10833
Source_Exons     11044 11163
Source_Exons     12085 12188
Source_Exons     12926 12988
Source_Exons     13846 15561
CDS     42 1049
......etc

Here the two objects have been compressed into one. The exons are the full
mRNA set of exons and the CDS tag is used to show where the CDS starts and
ends within the exons. Note that the CDS start/end positions are given in
the coordinates of the exons when spliced together and _not_ the
Source_Exons coordinates. Hence (if you do the maths...ugh) the start
position of "42" shows that the CDS starts about half way through the first
exon and the end position of "1049" shows that the CDS ends about a tenth of
the way into the last exon.

How is the new object displayed ? The CDS exon sections of the new object
can be given a different colour using the new "CDS_colour" tag in the Method
object, in the example below the non-CDS section of the exons will be
coloured blue while the CDS section will be red (the default colour is
magenta).

Sequence : "bA404F10.4"
Source   "bA404F10"
Source_Exons     1 96
Source_Exons     8934 9254
Source_Exons     10561 10833
Source_Exons     11044 11163
Source_Exons     12085 12188
Source_Exons     12926 12988
Source_Exons     13846 15561
CDS     42 1049
......etc
Method   "my_CDS"

Method : "my_CDS"
Colour   BLUE
CDS_colour  RED
......etc

What about if the CDS extends beyond the set of exons in the object ?
There are two tags in the existing models commonly used in the Sanger Centre
that can be used to deal with this situation:

// #Sequence#   (From models.wrm for 22ace)

?Sequence DNA UNIQUE ?DNA UNIQUE Int                    // Int is the length
......etc
          Properties    Pseudogene Text
......etc
                        End_not_found
                        Start_not_found Int

These tags have the following meaning:

Start_not_found Int
     Setting this tag means that the start of the CDS lies somewhere
     upstream of the exons in this object. The Int should be given one of
     the values 1,2 or 3 to establish the reading frame for protein
     translation of the CDS (default to 1 if no value given). Note also that
     to be pedantic the model should say "Start_not_found UNIQUE Int".
End_not_found
     Setting this tag means that the end of the CDS lies somewhere
     downstream of the exons in the object. This tag is not followed by an
     int because it is assumed that transcription will always procede to the
     end of the last exon and the transcription code can itself detect when
     to end transcription because it has run out of codons.

How do I go from two objects down to one ? Well the first point to note is
that the new code will run perfectly well with databases that contain the
"two object" representation of the mRNA/CDS exons, merging of objects can be
made gradually as required. It is not possible (and almost certainly not
desireable) for the code to do this automatically, the existing two objects
are linked only by "similar" names and a common sequence parent often shared
with many other objects all containing exons. Conversion will require the
use of a specially written script to extract the two objects from the
database, merge them and parse them back into the database.

How can I control which sections of the single object are operated on by
the various protein translation options in fmap ? The fmap menu for exons
now includes options to either translate the CDS section or the entire set
of exons and display or export the result.

Improved format for server log
++++++++++++++++++++++++++++++

The acedb socket server log has changed name from database/server.log to
database/serverlog.wrm to be consistent with other acedb log/configuration
files.

The records in the server log are now output in the same format as the
log.wrm records which brings the following improvements:

   * All records are now stamped with the time, machine and process id for
     the program.
   * All programs record SESSION_START and SESSION_END records which show
     information such as which user was running the program, how the program
     ended (normal or crash) and so on (see also Januarys newsletter).

Interactive control of DNA mismatch reporting in xace
+++++++++++++++++++++++++++++++++++++++++++++++++++++

Originally xace would report every single mismatch between every pair of DNA
objects it attempted to align. This was so irritating that the code was
changed to report errors only once per pair of objects aligned. Sadly this
is still exceptionally irritating for those who are trying to construct
large links from existing sequences because the number of pairs of DNA
objects to be aligned can be very large. This is exacerbated by the fact
that the user, when first making a link, already knows the DNA is
incorrectly aligned.

You can now interactively turn on/off reporting of DNA mismatches by
selecting the "Report/Don't Report DNA Mismatches" item from the main menu
in the fmap. Reporting will stay disabled for each subsequent reuse of that
fmap.

Articles
********

Bugs Fixed
**********

Tree Display menu
+++++++++++++++++

Several problems have been fixed in the Tree Display menu, some old options
that were removed have been put back because users preferred them to the new
ones, e.g. "Preserve". A bug where the "Show As Text" option disappeared
from the menu has been fixed. The menu should now work as it always as but
with some extra options, if you still have problems with the menu while
using the latest monthly build then please mail to acedb at sanger.ac.uk.

Catching SIGABRT
++++++++++++++++

The operating system sometimes needs to interrupt the execution of a program
perhaps because of a serious error such as the program trying to access
another programs memory space. It does this by directly interrupting the
programs execution with a "signal", the signal could be one of a number of
types such as "SIGSEGV" which means the program was trying to access another
programs memory or "SIGFLT" which means the program was trying to do an
illegal floating point operation such as dividing by zero. The program is
allowed to "catch" these signals and try to decide what to do about them.
AceDB catches signals so that it can clear up its read/write locks before
exitting.

One of these signals is reserved specifically for interrupting a program and
producing a snapshot of what the program was doing when interrupted, this is
the familiar "core" file. The signal for doing this is called SIGABRT. The
acedb code was erroneously catching this signal meaning that the core file
was not produced correctly, or in some cases not produced at all.

This bug has been fixed and the following now applies for signal handling:

SIGABRT
     signal is _never_ caught by AceDB.
all other catchable signals
     signals are caught and acedb clears read/write locks and gives a chance
     to save work before exitting.

Sometimes with serious, reproducible bugs it would be useful for AceDB to
not catch any signals so that a core file would be produced when the error
occurs. Signal handling can now be turned off in one of two ways:

"-nosigcatch"
     Use this command line option when you run an acedb program to turn off
     signal catching from the start:

                tace -nosigcatch /your/database


"Admin" menu item in xace
     There is a new "Signal Catching Off/On" option in the "Admin" menu
     which can be used to toggle signal catching on and off.

By default programs will run as they always have with signal catching turned
on. This is how you should normally run the code, if you turn signal
catching off and have been writing to the database when acedb crashes, the
database will not be cleaned up with the result that it may get corrupted.
This facility is intended for use in debugging difficult errors, not as a
standard way to run acedb.

Print bug
+++++++++

An annoying bug whereby xace would sometimes "freeze" when an attempt was
made to print has now been fixed (DDTS bugs: SANgc10014 & SANgc10359).

Tablemaker bug
++++++++++++++

A bug in the meaning of "hidden column" in tablemaker meant it mapped onto
the "hidden" state in the table display system, which caused rows which
differed only in hidden columns to appear multiple times. The semantics of
"hidden" in tablemaker are not "don't show me this column", but "this is an
intermediate working column which doesn't appear in the result table". The
code was changed to reflect this, columns marked as hidden are not included
at all in the output table.

Two URL bugs
++++++++++++

Two fixes for url handling by acedb:

   * the "(" and ")" characters in Urls need to be escaped (i.e. not
     parsed), since the netscape remote mechanism mangles them.
   * If a Rewrite tag was followed by only one string, acedb crashed.

Dumping bug
+++++++++++

There was a bug in the dumping code for perl-style and other formats which
caused acedb to crash or give strange output if the text being dumped
contained a "%". This is now fixed.

Future Plans
************

If you wish to make suggestions about any of these plans, please mail them
to acedb at sanger.ac.uk

AceDB and gapped alignments
+++++++++++++++++++++++++++

Coming in 4_9...

As of AceDB 4_9 (due out anytime now...honest...), AceDB will support the
viewing of gapped alignments in Blixem. This is a combination of work on
Blixem and the new "Smap" code that will support a much more sophisticated
way of contructing "virtual" sequences from clone, gene etc. data than the
current fmap supports.

AceDB and XML Schema
++++++++++++++++++++

Coming soon...

Work is currently under way to output Ace data in XML format. As well as the
data, AceDB will output XML Schema that describe the data and will enable
the data to be verified using existing XML parsers that support XML Schema.

February monthly build now available.
*************************************

You can pick up the monthly builds from:

Sanger users
     ~acedb/RELEASE.DEVELOPMENT
External users
     http://www.acedb.org/Software/Downloads/monthly.shtml

Next User Group Meeting - D319, 2.30pm, Thursday, 15th March 2001
*****************************************************************

_!*! Please note changed venue !*!_
----------------------------------------------------------------------------
----------------------------------------------------------------------------
Ed Griffiths <edgrif at sanger.ac.uk>
Last modified: Thu Mar 8 11:51:30 GMT 2001








More information about the Acedb mailing list