[Genbank-bb] GenBank WGS projects : WGS-master records provided as of January 12

Cavanaugh, Mark (NIH/NLM/NCBI) [E] via genbankb%40net.bio.net (by cavanaug from ncbi.nlm.nih.gov)
Fri Jan 2 11:51:53 EST 2009


Greetings GenBank Users,

Starting on January 12th of 2009, a new type of data file will be
made available for GenBank WGS (Whole Genome Shotgun) projects,
in the WGS areas of our FTP site.

Since their inception in 2002, WGS projects have had an associated
'WGS-master' record, which summarizes the content of a project. Here
is a link to the master for project ABRT (Philippine tarsier) :

http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=203287470

And here is an excerpt from that master record:

LOCUS       ABRT010000000        1201173 rc    DNA     linear   PRI
18-NOV-2008
DEFINITION  Tarsius syrichta, whole genome shotgun sequence.
ACCESSION   ABRT000000000
VERSION     ABRT000000000.1  GI:203287470
PROJECT     GenomeProject:20339
KEYWORDS    WGS.
SOURCE      Tarsius syrichta (Philippine tarsier)
....
WGS         ABRT010000001-ABRT011201173
WGS_SCAFLD  GG299110-GG500513
//


This flatfile representation of the ABRT WGS-master does *not*
conform to the specifications for normal GenBank flatfiles.
For example:

- It has neither sequence data nor a CONTIG join() statement.

- The 'rc' (record count) value on the LOCUS line represents the
  number of sequence-overlap contig records in the project, rather
  than a basepair count.

- Undocumented linetypes 'WGS' and 'WGS_SCAFLD' exist, which 
  provide the ranges of accession numbers for the 1,201,173
  sequence-overlap contig sequences in the project, and for
  the 201,404 CON-division records that have been constructed
  from the ABRT01 contigs.

Nonetheless, a WGS-master record has utility because it provides
an overview of many important characteristics of a WGS project,
in a simple and concise way.

The ASN.1 version of WGS-master records will be placed in:

	ftp://ftp.ncbi.nih.gov/ncbi-asn1/wgs

and the file naming convention will be:

	wgs.XXXX.mstr.bse.gz

These files will contain a gzip-compressed, binary ASN.1 Seq-entry
value. 'XXXX' represents a four-character WGS Project Code, such as
ABYH.

The GenBank flatfile representation of WGS-master records will be
placed in:

	ftp://ftp.ncbi.nih.gov/genbank/wgs

and the file naming convention will be:

	wgs.XXXX.mstr.gbff.gz

Here is an example of the filenames that one would encounter for
the ABYH project in the /genbank/wgs area, as of January 12:

	wgs.ABYH.1.gbff.gz
	wgs.ABYH.1.gnp.gz
	wgs.ABYH.1.qscore.gz
	wgs.ABYH.mstr.gbff.gz

If you process the GenBank flatfile representation of WGS projects,
and you are *not* interested in WGS-masters, you may need to add
a filtration step to remove the master files from automated FTP
transfers (due to similarities in filename patterns).

Mark Cavanaugh
GenBank
NCBI/NLM/NIH/HHS




More information about the Genbankb mailing list