First Zebrafish Assembly Released
Kerstin Jekosch
kj2 at sanger.ac.uk
Tue Jul 23 11:44:45 EST 2002
First assembly of the zebrafish genome released
===============================================
Please note that this is a *preliminary* assembly and there are a
number of points to remember:
There is a high level of misassembly. This is because the source
DNA came from ~1000 5 day old embryos and the polymorphism
is at least 1/200bps with additional significant indels. Thus regions
of the genome which are highly variable do not form clusters for
assembly since the sequences that originate from a given region
are quite likely from different haplotypes. This causes assembly
dropouts for some regions and false duplications in other regions
where phrap splits different haplotypes into multiple paths. We are
working on the assembly code, Phusion, to address these issues.
However, there is an enormous amount of useful sequence in this
assembly and hope this outweighs the problems in the assembly.
More information is available at:
=================================
ftp://ftp.ensembl.org/pub/traces/zebrafish/assembly/assembly06/RE
ADME
Although the assembly is being made available as early as
possible to the research community, an Ensembl gene build has
NOT yet been performed. We are investigating this now but for the
moment Ensembl wil continue to present clone-based data.
We plan to release an updated Ensembl which presents all normal
Ensembl features except Ensembl gene predictions in a few weeks.
The assembly may be searched using BLAST at:
http://www.ensembl.org/Danio_rerio/blastview
and by SSAHA at:
http://www.ensembl.org/Danio_rerio/ssahaview
Note that Zebrafish SSAHA now supports very rapid queries using
protein sequences. This feature will be extended to all Ensembl
species in due course.
Assembly data are available at:
ftp://ftp.ensembl.org/pub/traces/zebrafish/assembly/assembly06/
Assembly Statistics
===================
We started with 9643640 reads comprising 6.07Gbp (630bps
average RL). There are 7942778 unique reads, 82.4% of starting
reads, in the assembly.
Phusion was used to cluster the reads and phrap was used for
cluster assembly and consensus generation
Small supercontigs with less than 3 reads or smaller than 1kb were
rejected. 3.5Mbp of the assembly was rejected as possible
contamination based on read source statistics at the supercontig
level.
For the supercontigs (bp measures include estimated gap sizes):
Total bases = 1169967887 bps
Supercontigs = 158689
Average length = 7372 bps
Largest = 168788 bps
bases contigs
N50 = 20521, n = 16515
Estimated coverage based on 12Mbp of 143 finished clones gives:
Supercontig coverage: 77%
Contig coverage: 61%
++++++++++++++++++++++++++++++++++++++++++++
Dr. Kerstin Jekosch phone +441223494971
Bioinformatics fax +441223494919
Wellcome Trust Sanger Institute
Hinxton, Cambridge CB10 1SA, UK
kj2 at sanger.ac.uk
++++++++++++++++++++++++++++++++++++++++++++
More information about the Zbrafish
mailing list