new zebrafish whole genome shotgun

Kerstin Jekosch kj2 at sanger.ac.uk
Thu Apr 3 12:25:48 EST 2003


Please visit our web site at http://www.sanger.ac.uk/Projects/D_rerio/
to find the  

Second assembly Zv2 of the zebrafish genome released

Please note that this is a *preliminary* assembly and there are a
number of points to remember:

There is a high level of misassembly.  This is because the source DNA 
came from ~1000 5 day old embryos and the polymorphism is at least 
1/200bps with additional significant indels.  Thus regions of the 
genome which are highly variable do not form clusters for assembly 
since the sequences that originate from a given region are quite likely 
from different haplotypes. This causes assembly dropouts for some 
regions and false duplications in other regions where phrap splits 
different haplotypes into multiple paths.  We are working on the 
assembly code, Phusion, to address these issues.  However, there is 
an enormous amount of useful sequence in this assembly and hope 
this outweighs the problems in the assembly.  

We tried to include the fingerprint information from our fpc database to 
merge assembly supercontigs.  If this could be done, the new contigs 
were named after the fpc contig that lead to the merge (eg. ctg123).  
However, please not that this assembly is not tied to a map and 
mapping information derived from the contig names are therefore to be 
treated with care. We will offer a search tool to make all mapping 
information for a certain supercontig available soon.  

Although the assembly is being made available as early as possible to 
the research community, an Ensembl gene build has NOT yet been 
performed. An ensembl pre-release however is available.  

Assembly Statistics:

We started with 11737560 reads comprising 7.64 Gbp (651 bps 
average RL).  There are 9953938 unique reads, 84.8 % of the total 
reads, placed in the assembly.  

Phusion was used to cluster the reads and phrap was used for cluster 
assembly and consensus generation  

Small supercontigs with less than 3 reads or smaller than 1kb were 
rejected.  

For the supercontigs (bp measures include estimated gap sizes):

Contig
stats:                                   
Total bases          = 1306256104
bps                                   
contig number        =    
430985                                   
Average length       =       3030
bps                                   
Largest              =      44497
bps                                   
bases / contigs: N50 = 4451, n =  
87069                                   

Supercontig stats (bp measures include estimated gap
sizes):                                   
Total bases          = 1452210772
bps                                   
Supercontigs         =     
83470                                   
Average length       =      17398
bps                                   
Largest              =    3581975
bps                                   
bases / contigs: N50 = 296896, n =  
1397                                   

Estimated coverage based on 93 Mbp of 656 finished clones
gives:                                   
Supercontig coverage:
95%                                   
Contig coverage:      77%


-- 
Dr. Kerstin Jekosch                email kj2 at sanger.ac.uk     
Project Leader                     tel   +44 (0)1223 494971
Zebrafish Genome Analysis          fax   +44 (0)1223 494919
Wellcome Trust Sanger Institute           
Hinxton, Cambridge CB10 1SA, UK







More information about the Zbrafish mailing list