how do I automatically update gene coordinates after re-sequencing the genome
m.claesson at student.ucc.ie
Tue Jul 15 08:57:10 EST 2003
Thanks for your answer Tim,
Frustrating that so many go thorugh the same ordeals over and over
again isn't it? However, after I sent out my question I found a public
program that seem to solve the problem at least for prokaryotes. It's
Sequin at http://www.ncbi.nlm.nih.gov/Sequin/
It has quite a nice feature that checks the differences between old
and new sequence, let you scroll thorugh them and then updates. The
input is fasta format and a tab-separated textfile with annotations.
Sequin then creates a genbank entry that can be updated with new
sequence in fasta format.
To me this looks very good! Have a look at it!
timc at chiark.greenend.org.uk (Tim Cutts) wrote in message news:<S+x*-W7Wp at news.chiark.greenend.org.uk>...
> In article <e818c15b.0307100440.58d9f00b at posting.google.com>,
> Marcus Claesson <m.claesson at student.ucc.ie> wrote:
> >One could do this by writing a program that blastn all the genes
> >against the new sequence and then pick out the new coordinates for the
> >nearly identical hits. Gene duplicates etc could make it a bit messy
> For bacterial genomes, BLAST is probably fast enough. For mammalian
> genomes it isn't (unless you have many hundreds of CPUs available, which
> only a few sites do).
> >Has anyone out there done this before, and do you have any tips?
> >Would be extremely grateful if you could share them!
> Most genome annotation projects have this problem, as you suggest.
> I used to work at Incyte Genomics, and while there I employed someone
> specifically to write code to solve this very problem. Unfortunately
> the code was not made publically available.
> I can't speak for the particular problems of bacterial genomes, but in
> the human genome we were hit by the usual issues; many features are very
> difficult to remap automatically. For example, I remember trying to
> remap an STS tiling path across the coding regions of one particular
> gene from the original gene build (which was on HTG draft sequence) onto
> the final sequence that came along later.
> The problem was that this particular gene had about 12 alternative 5'
> exons, which were on average about 98% sequence identical with each
> other. Made remapping very difficult (as well as designing unique STSs
> for that gene, of course!)
> The second problem was speed. BLAST and other DP algorithms just
> weren't fast enough. We did come up with an exact string matching
> method that was much faster, but were usually left with about 20% of
> features which the algorithm would flag up as needing human
> intervention; typically this occurred when the new version of the
> sequence contained indels relative to the original build sequence.
> The smaller the feature, the harder it is to remap, of course, because
> it has more chance of occurring by chance. SNPs were the trickiest, of
> course, since you then have to decide on how much flanking sequence to
> use to help the mapping process. The more you use, the more accurate it
> gets, but slower to run.
> Many annotation projects currently seem to prefer the approach of
> re-running their automated annotation pipelines than trying to remap
> their existing annotation.
> You may consider this to be burying one's head in the sand, and I
> couldn't possibly comment. :-)
More information about the Bio-soft