First Zebrafish Assembly Released

Kerstin Jekosch kj2 at
Tue Jul 23 11:44:45 EST 2002

First assembly of the zebrafish genome released

Please note that this is a *preliminary* assembly and there are a
number of points to remember:

There is a high level of misassembly.  This is because the source 
DNA came from ~1000 5 day old embryos and the polymorphism 
is at least 1/200bps with additional significant indels.  Thus regions 
of the genome which are highly variable do not form clusters for 
assembly since the sequences that originate from a given region 
are quite likely from different haplotypes. This causes assembly 
dropouts for some regions and false duplications in other regions 
where phrap splits different haplotypes into multiple paths.  We are 
working on the assembly code, Phusion, to address these issues.  
However, there is an enormous amount of useful sequence in this 
assembly and hope this outweighs the problems in the assembly.  

More information is available at:

Although the assembly is being made available as early as 
possible to the research community, an Ensembl gene build has 
NOT yet been performed. We are investigating this now but for the 
moment Ensembl wil continue to present clone-based data.   

We plan to release an updated Ensembl which presents all normal
Ensembl features except Ensembl gene predictions in a few weeks.

The assembly may be searched using BLAST at:

and by SSAHA at:

Note that Zebrafish SSAHA now supports very rapid queries using
protein sequences. This feature will be extended to all Ensembl
species in due course.

Assembly data are available at:

Assembly Statistics

We started with 9643640 reads comprising 6.07Gbp (630bps 
average RL). There are 7942778 unique reads, 82.4% of starting 
reads, in the assembly.  

Phusion was used to cluster the reads and phrap was used for 
cluster assembly and consensus generation  

Small supercontigs with less than 3 reads or smaller than 1kb were 
rejected. 3.5Mbp of the assembly was rejected as possible 
contamination based on read source statistics at the supercontig 

For the supercontigs (bp measures include estimated gap sizes):

  Total bases    = 1169967887 bps
  Supercontigs   =     158689
  Average length =       7372 bps
  Largest        =     168788 bps

  bases        contigs
  N50 = 20521, n =   16515

Estimated coverage based on 12Mbp of 143 finished clones gives:
  Supercontig coverage: 77%
  Contig coverage: 61%

Dr. Kerstin Jekosch      phone +441223494971           
Bioinformatics           fax   +441223494919          
Wellcome Trust Sanger Institute
Hinxton, Cambridge CB10 1SA, UK
kj2 at	        

More information about the Zbrafish mailing list