[Genbank-bb] GenBank WGS products : Scaffolds for ALWZ02 : new 4+2+"S"+digits accession format

Cavanaugh, Mark (NIH/NLM/NCBI) [E] via genbankb%40net.bio.net (by cavanaug from ncbi.nlm.nih.gov)
Wed Dec 4 12:08:26 EST 2013

Greetings GenBank Users,

The release notes for October's GenBank 198.0 included an announcement
of a new accession format for CON-division WGS scaffold records. It 
is included below for your reference.

This new accession format (in which the accessions for WGS scaffolds
are very similar to the accessions of the WGS contigs from which they
are constructed) will initially be used for WGS projects that :

a) Have a very large number of contigs (typically, greater than 1 million)
b) Have a correspondingly large number of scaffolds
c) Are completely unannotated, at both the contig and scaffold level.

And the first WGS project which has these properties is: ALWZ02 .

The contigs for ALWZ02 have been available at the NCBI FTP since
mid-June 2013, in the genbank/wgs directory. There are 43 pairs of
GenBank flatfile and nucleotide FASTA files. For example:

-rw-r--r-- 1 gbupdate gbproces 147855124 Jun 27 16:24 wgs.ALWZ.1.fsa_nt.gz
-rw-r--r-- 1 gbupdate gbproces 229952261 Jun 27 16:29 wgs.ALWZ.1.gbff.gz
-rw-r--r-- 1 gbupdate gbproces 121080718 Jun 27 16:28 wgs.ALWZ.43.fsa_nt.gz
-rw-r--r-- 1 gbupdate gbproces 172868580 Jun 27 16:32 wgs.ALWZ.43.gbff.gz

And now, as of December 4 2013, there is a similar set of files for
the ALWZ02 scaffolds:

-rw-r--r-- 1 gbupdate gbproces 147881072 Dec  4 10:45 wgs.ALWZ.scflds.1.fsa_nt.gz
-rw-r--r-- 1 gbupdate gbproces  28173600 Dec  4 10:52 wgs.ALWZ.scflds.1.gbff.gz
-rw-r--r-- 1 gbupdate gbproces 117159578 Dec  4 10:51 wgs.ALWZ.scflds.48.fsa_nt.gz
-rw-r--r-- 1 gbupdate gbproces   1202661 Dec  4 10:53 wgs.ALWZ.scflds.48.gbff.gz

There are cognate sets of files for the ASN.1 version of the ALWZ WGS
project, in the ncbi-asn1/wgs directory of the NCBI FTP site.

Here's an excerpt of the flatfile for the first ALWZ scaffold, which
illustrates the new accession number format:

LOCUS       ALWZ02S0000001           701 bp    DNA     linear   CON 14-JUN-2013
DEFINITION  Picea glauca scaffold316, whole genome shotgun sequence.
ACCESSION   ALWZ02S0000001 ALWZ000000000
VERSION     ALWZ02S0000001.1
DBLINK      BioProject: PRJNA83435
SOURCE      Picea glauca (white spruce)

So, for WGS projects which meet criteria (a) through (c) above, the
comprehensive WGS FTP areas will now contain data for both contigs
*and* scaffolds. And the scaffold records are making use of the new
accession format.

NOTE: Assembly-Version 03 of the ALWZ WGS project is being processed
now, so all of the ALWZ files at the NCBI FTP site are likely to be 
updated within the next few weeks.


Mark Cavanaugh


1.4 Upcoming Changes

1.4.1 New accession format for CON-division WGS scaffold records

  WGS scaffolds that are constructed from WGS contigs currently
make use of a '2+6' accession number format, with two leading
alphabetic characters followed by six digits. Here is an example
of a WGS-master record that references two different ranges of
scaffold accession numbers:


LOCUS       AABR06000000          112651 rc    DNA     linear   ROD 16-MAR-2012
DEFINITION  Rattus norvegicus strain BN/SsNHsdMCW, whole genome shotgun
            sequencing project.
VERSION     AABR00000000.6  GI:380236478
DBLINK      BioProject: PRJNA10629
SOURCE      Rattus norvegicus (Norway rat)
  ORGANISM  Rattus norvegicus
WGS         AABR06000001-AABR06112651
WGS_SCAFLD  CM000072-CM000092
WGS_SCAFLD  JH612139-JH620698

  Many WGS projects have a large number of chromosome-specific scaffolds
(such as the JH accession range), and a much smaller number of scaffolds
that represent the entirety of the chromosomes (such as the the CM
accession range). Because of the former, we are consuming '2+6' prefixes,
like JH, at an unsustainable rate.

So we plan to introduce a new accession format for WGS scaffolds which
mirrors the format of the underlying WGS contigs:

  4 letter WGS project code
  2 digit assembly-version number
  "S" (for 'scaffold')
  Six or seven digits

So in the above example, the set of 'JH' scaffolds could make use of
accession numbers such as AABR06S000001 and AABR06S112651 :

WGS         AABR06000001-AABR06112651
WGS_SCAFLD  CM000072-CM000092
WGS_SCAFLD  AABR06S000001-AABR06S112651

  We do not currently plan to replace existing '2+6' accessions with
the new '4+2+S+6/7' accessions. However, as of the December 2013
GenBank release, the new format will begin to appear for newly-processed
WGS sequencing projects.

