IUBio

[Genbank-bb] GenBank Release 230.0 Available : February 21 2019 : 9:17pm

Cavanaugh, Mark (NIH/NLM/NCBI) [E] via genbankb%40net.bio.net (by cavanaug from ncbi.nlm.nih.gov)
Thu Feb 21 21:27:24 EST 2019


Greetings GenBank Users,

  GenBank Release 230.0 is now available via FTP from the National Center
for Biotechnology Information (NCBI):

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ftp.ncbi.nih.gov   genbank     GenBank Release 230.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 230.0

 Close-of-data for GenBank 230.0 occurred on 02/17/2019. Uncompressed,
the Release 230.0 flatfiles require roughly 964 GB (sequence files only).
The ASN.1 data require approximately 773 GB.

Recent statistics for 'traditional' sequences (including non-bulk-oriented
TSA, and excluding WGS, bulk-oriented TSA, TLS, and the CON-division):

  Release  Date      Base Pairs    Entries

  229      Dec 2018  285688542186  211281415
  230      Feb 2019  303709510632  212260377   

Recent statistics for WGS sequencing projects:

  Release  Date      Base Pairs    Entries

  229    Dec 2018  3656719423096   773773190
  230    Feb 2019  4164513961679   945019312  

Recent statistics for bulk-oriented TSA sequencing projects:

  Release  Date      Base Pairs     Entries

  229    Dec 2018   248592892188   274845473  
  230    Feb 2019   263936885705   294772430

Recent statistics for bulk-oriented TLS sequencing projects:

  Release  Date      Base Pairs     Entries

  229    Dec 2018     8511829281    20924588
  230    Feb 2019     9146836085    23259929
  
During the 64 days between the close dates for GenBank Releases 229.0
and 230.0, the 'traditional' portion of GenBank grew by 18,020,968,446
basepairs and 978,962 sequence records. During that same period,
25,301 records were updated. An average of 15,691 'traditional' records
were added and/or updated per day.

  Between releases 229.0 and 230.0, the WGS component of GenBank grew by
507,794,538,583 basepairs and by 171,246,122 sequence records.

  Between releases 229.0 and 230.0, the TSA component of GenBank grew by
15,343,993,517 basepairs and by 19,926,957 sequence records.

  Between releases 229.0 and 230.0, the TLS component of GenBank grew by
635,006,804 basepairs and by 2,335,341 sequence records.

  For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 230.0 and Upcoming Changes) have been appended
below for your convenience.

                * * * Important Notice * * *

  Section 1.3.2 of the GenBank release notes describes changes to accession
formats for traditional nucleotide sequences, WGS/TSA/TLS sequences, and
for protein sequences. Several of the new formats are now in production use
for some classes of WGS projects. These important changes are likely
to be of interest to many GenBank users, and we encourage a review of
the section.

  Release 230.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.

  As a general guideline, we suggest first transferring the GenBank
release notes (gbrel.txt) whenever a release is being obtained. Check
to make sure that the date and release number in the header of the
release notes are current (eg: February 15 2019, 230.0). If they are
not, interrupt the remaining transfers and then request assistance from
the NCBI Service Desk.

  A comprehensive check of the headers of all release files after your
transfers are complete is also suggested. Here's how one might go about
this on a Unix or Linux platform, using csh/tcsh :

        set files = `ls gb*.*`
        foreach i ($files)
                head -10 $i | grep Release
        end

Or, if the files are compressed, perhaps:

        gzcat $i | head -10 | grep Release

  If you encounter problems while ftp'ing or uncompressing Release
230.0, please send email outlining your difficulties to:

        info from ncbi.nlm.nih.gov

Mark Cavanaugh, Michael Kimelman, Ilya Dondoshansky
GenBank
NCBI/NLM/NIH/HHS


1.3 Important Changes in Release 230.0

1.3.1 Organizational changes

1.3.1.a Large decrease in the number of sequence files

  The number of sequence files for GenBank 230.0 has decreased significantly,
from 3,287 files to 2,272 files.

  This has occurred due to changes in how EST and GSS records are stored at
NCBI, which has impacts for both Entrez and for GenBank Release file
organization. For information about the Entrez impacts, please see:

https://ncbiinsights.ncbi.nlm.nih.gov/2018/07/30/upcoming-changes-est-gss-databases/

  The GenBank Release files are impacted because we increased the targeted
uncompressed file size for *all* of the sequence data files to 500MB, to match
the target used by EST and GSS for many years.

  The overall effect is that this release has fewer files, and each file is
larger than before. For example, there were 566 BCT-division GenBank flatfiles
for GenBank 229.0, most of which were about 250MB (uncompressed). For GenBank
230.0 the number of BCT files is 324, most of which are about 500MB
(uncompressed).

  However, there were two issues that caused a net *increase* in the number of
files for the SYN and EST divisions.

  A Jan 2019 submission of 57 synthetic chromosomal constructs ranging from
35Mbp to 271Mbp nearly tripled the number of SYN files. Unfortunately, the
size of these records limits some SYN files to only 1 to 4 records apiece.
The gbsyn2.seq data file is a good example, containing just this record:

LOCUS       CP034487           271050050 bp    DNA     linear   SYN 08-JAN-2019
DEFINITION  Eukaryotic synthetic construct chromosome 1.
ACCESSION   CP034487
VERSION     CP034487.1
DBLINK      BioProject: PRJNA504496
            BioSample: SAMN10411725
KEYWORDS    .
SOURCE      eukaryotic synthetic construct
  ORGANISM  eukaryotic synthetic construct
            other sequences; artificial sequences.
REFERENCE   1  (bases 1 to 271050050)
  AUTHORS   Fu,S., Wang,A. and Au,K.F.
  TITLE     A comparative evaluation of hybrid error correction methods for
            error-prone long reads
  JOURNAL   Unpublished
REFERENCE   2  (bases 1 to 271050050)
  AUTHORS   Fu,S., Wang,A. and Au,K.F.
  TITLE     Direct Submission
  JOURNAL   Submitted (11-DEC-2018) Internal medicine, University of Iowa, 285
            Newton Road, Iowa City, IA 52242, USA
COMMENT     This genome was generated by simulation. We randomly generated a
            set of SNPs and modified the human hg38 reference genome. We used
            the the simulated genome to test the performance of error
            correction tools on  heterozygous sequences.

  For the EST division, in spite of attempting to achieve 500MB uncompressed
per file, 93 of the gbest*.seq files are less than half of that target size.
Some of the files are dramatically less, such as gbest302.seq at only 5MB.
So ironically, the number of EST division files has increased from 489 to 574
for GenBank 230.0 . We will work to address EST file-size uniformity problem
in future releases.

  Note: The uncompressed file-size target for GenBank flatfiles is periodically
increased due to database growth. It is likely that the target will increase
to 750MB within the next year or two.

1.3.1.b Division-specific changes in number of sequence files

The total number of sequence data files decreased by 1,015 with this release:

  - the BCT division is now composed of 324 files (-242)
  - the CON division is now composed of 205 files (-170)
  - the ENV division is now composed of  57 files (-48)
  - the EST division is now composed of 574 files (+85)
  - the GSS division is now composed of 268 files (-40)
  - the HTC division is now composed of   8 files (-7)
  - the HTG division is now composed of  82 files (-73)
  - the INV division is now composed of  68 files (-49)
  - the MAM division is now composed of  32 files (-23)
  - the PAT division is now composed of 193 files (-154)
  - the PHG division is now composed of   3 files (-2)
  - the PLN division is now composed of 143 files (-98)
  - the PRI division is now composed of  33 files (-26)
  - the ROD division is now composed of  17 files (-14)
  - the STS division is now composed of  11 files (-9)
  - the SYN division is now composed of  27 files (+17)
  - the TSA division is now composed of 127 files (-107)
  - the VRL division is now composed of  32 files (-27)
  - the VRT division is now composed of  67 files (-28)

1.3.2 Expanded accession number formats now in use for WGS projects

  As of the February 2019 GenBank 230.0 release, several of the
long-planned expanded accession number formats are in use for
sequences in some WGS projects.

  Previously, the accession format used for Whole Genome Shotgun (WGS),
Transcriptome Shotgun Assembly (TSA), and Targeted Locus Study (TLS)
sequencing projects consisted of a four-letter Project Code prefix,
a two-digit Assembly-Version number, followed by 6, 7, or 8 digits
(depending on the number of sequences in the project).

  This format has been expanded to a six-letter Project Code prefix,
two-digit Assembly-Version number, followed by 7, 8, or 9 digits.

  Protein sequences have made use of a "3+5" accession format, consisting
of a three-letter prefix followed by five digits. This format has been
expanded to make use of seven digits, allowing exhausted protein accession
ranges such as EAA00001-EZZ99999 to be re-opened, as EAA0000001-EZZ9999999.
  
  Project AAAABB provides a good example of both of these changes:

LOCUS       AAAABB010000001       100500 bp    DNA     linear   BCT 30-JAN-2019
DEFINITION  Clostridioides difficile isolate CD24
            SAMN10715316-rid7006163.denovo.001, whole genome shotgun sequence.
ACCESSION   AAAABB010000001 AAAABB010000000
VERSION     AAAABB010000001.1
DBLINK      BioProject: PRJNA278886
            BioSample: SAMN10715316
            Sequence Read Archive: SRR8419280
KEYWORDS    WGS; GMI.
....
     CDS             39..485
                     /gene="lspA"
                     /locus_tag="EPE87_00005"
                     /EC_number="3.4.23.36"
                     /inference="COORDINATES: similar to AA
                     sequence:RefSeq:YP_001089114.1"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Protein Homology."
                     /codon_start=1
                     /transl_table=11
                     /product="signal peptidase II"
                     /protein_id="EAA0000001.1"
                     /translation="MLYILIIILLIGLDQLSKIWVLNNLVDVSTIPIINNVFHLTYVE
                     NRGAAFGLLQNNQWIFIIVALLATVFGLYYLNTRKVHIFGRLGIILIISGALGNLVDR
                     VRLGFVVDYFDFRVIWEYVFNVADVFVVVGTVFLCIYVLFFESKSR"

  Accession numbers in the old formats will continue to be assigned
for some classes of GenBank submissions, until the available 4-letter
project code prefixes and the available 3+5 protein accession ranges
have been exhausted.

  Non-WGS/TLS/TSA nucleotide sequences currently make use of a "2+6"
accession format, consisting of a two-letter prefix followed by six digits.
This format will be expanded to make use of eight digits, allowing accession
ranges that have been exhausted (eg: JG000001-JG999999) to be "re-opened"
(eg: JG00000001-JG99999999). No nucleotide sequence records have yet been
assigned a "2+8" accession number, but this will certainly occur within
the next few months.

1.3.3 GSS File Header Problem : Resolved

  GSS sequences at GenBank used to be maintained in two different systems,
depending on their origin. This caused a discrepancy between the filenames
and file headers of 130 GSS flatfiles, because the dumps from each system
did not know how many files were being dumped by the other. For example,
the header for gbgss179.seq in GenBank 229.0 was:

GBGSS1.SEQ          Genetic Sequence Data Bank
                        December 15 2018

                NCBI-GenBank Flat File Release 229.0

                           GSS Sequences (Part 1)

   87375 loci,    64103840 bases, from    87375 reported sequences

  The filename and part number in the header was "1", though the file
had been renamed as "179" based on the number of files dumped from the
other system. Files gbgss179.seq.gz through gbgss308.seq.gz were affected
in GenBank 229.0 . However, as of GenBank 230.0, GSS sequences are
stored in a single system at NCBI, so this file header problem no longer
occurs.

1.4 Upcoming Changes

  No GenBank release changes are planned for the next four months.
  



More information about the Genbankb mailing list

Send comments to us at biosci-help [At] net.bio.net