From cavanaug from ncbi.nlm.nih.gov Fri Jan 2 11:51:53 2009 From: cavanaug from ncbi.nlm.nih.gov (Cavanaugh, Mark (NIH/NLM/NCBI) [E]) Date: Fri Jan 2 13:11:01 2009 Subject: [Genbank-bb] GenBank WGS projects : WGS-master records provided as of January 12 Message-ID: <7B6F170840CA6C4DA63EE0C8A7BB43EC03EDF47D@NIHCESMLBX15.nih.gov> Greetings GenBank Users, Starting on January 12th of 2009, a new type of data file will be made available for GenBank WGS (Whole Genome Shotgun) projects, in the WGS areas of our FTP site. Since their inception in 2002, WGS projects have had an associated 'WGS-master' record, which summarizes the content of a project. Here is a link to the master for project ABRT (Philippine tarsier) : http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=203287470 And here is an excerpt from that master record: LOCUS ABRT010000000 1201173 rc DNA linear PRI 18-NOV-2008 DEFINITION Tarsius syrichta, whole genome shotgun sequence. ACCESSION ABRT000000000 VERSION ABRT000000000.1 GI:203287470 PROJECT GenomeProject:20339 KEYWORDS WGS. SOURCE Tarsius syrichta (Philippine tarsier) .... WGS ABRT010000001-ABRT011201173 WGS_SCAFLD GG299110-GG500513 // This flatfile representation of the ABRT WGS-master does *not* conform to the specifications for normal GenBank flatfiles. For example: - It has neither sequence data nor a CONTIG join() statement. - The 'rc' (record count) value on the LOCUS line represents the number of sequence-overlap contig records in the project, rather than a basepair count. - Undocumented linetypes 'WGS' and 'WGS_SCAFLD' exist, which provide the ranges of accession numbers for the 1,201,173 sequence-overlap contig sequences in the project, and for the 201,404 CON-division records that have been constructed from the ABRT01 contigs. Nonetheless, a WGS-master record has utility because it provides an overview of many important characteristics of a WGS project, in a simple and concise way. The ASN.1 version of WGS-master records will be placed in: ftp://ftp.ncbi.nih.gov/ncbi-asn1/wgs and the file naming convention will be: wgs.XXXX.mstr.bse.gz These files will contain a gzip-compressed, binary ASN.1 Seq-entry value. 'XXXX' represents a four-character WGS Project Code, such as ABYH. The GenBank flatfile representation of WGS-master records will be placed in: ftp://ftp.ncbi.nih.gov/genbank/wgs and the file naming convention will be: wgs.XXXX.mstr.gbff.gz Here is an example of the filenames that one would encounter for the ABYH project in the /genbank/wgs area, as of January 12: wgs.ABYH.1.gbff.gz wgs.ABYH.1.gnp.gz wgs.ABYH.1.qscore.gz wgs.ABYH.mstr.gbff.gz If you process the GenBank flatfile representation of WGS projects, and you are *not* interested in WGS-masters, you may need to add a filtration step to remove the master files from automated FTP transfers (due to similarities in filename patterns). Mark Cavanaugh GenBank NCBI/NLM/NIH/HHS