IUBio

[Genbank-bb] GenBank Release 219.0 Available : April 19 2017

Cavanaugh, Mark (NIH/NLM/NCBI) [E] via genbankb%40net.bio.net (by cavanaug from ncbi.nlm.nih.gov)
Wed Apr 19 22:30:21 EST 2017


Greetings GenBank Users,

  GenBank Release 219.0 is now available via FTP from the National Center
for Biotechnology Information (NCBI):

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ftp.ncbi.nih.gov   genbank     GenBank Release 219.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 219.0

 Close-of-data for GenBank 219.0 occurred on 04/14/2017. Uncompressed,
the Release 219.0 flatfiles require roughly 818 GB (sequence files only).
The ASN.1 data require approximately 685 GB.

Recent statistics for 'traditional' sequences (including non-bulk-oriented
TSA, and excluding WGS, bulk-oriented TSA, TLS, and the CON-division):

  Release  Date      Base Pairs    Entries

  218      Feb 2017  228719437638  199341377
  219      Apr 2017  231824951552  200877884

Recent statistics for WGS sequencing projects:

  Release  Date      Base Pairs    Entries

  218    Feb 2017  1892966308635   409490397
  219    Apr 2017  2035032639807   451840147

Recent statistics for bulk-oriented TSA sequencing projects:

  Release  Date      Base Pairs     Entries

  218    Feb 2017   133517212104   151431485
  219    Apr 2017   149038907599   165068542

Recent statistics for bulk-oriented TLS sequencing projects:

  Release  Date      Base Pairs     Entries

  218    Feb 2017      636923295     1438349
  219    Apr 2017      636923295     1438349  (unchanged)f

  During the 60 days between the close dates for GenBank Releases 218.0
and 219.0, the 'traditional' portion of GenBank grew by 3,105,513,914
basepairs and by 1,536,507 sequence records. During that same period,
173,862 records were updated. An average of 28,506 'traditional' records
were added and/or updated per day.

  Between releases 218.0 and 219.0, the WGS component of GenBank grew by
142,066,331,172 basepairs and by 42,349,750 sequence records.

  Between releases 218.0 and 219.0, the TSA component of GenBank grew by
15,521,695,495 basepairs and by 13,637,057 sequence records.

  Between releases 218.0 and 219.0, the TLS component of GenBank was
unchanged.

  For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 219.0 and Upcoming Changes) have been appended
below for your convenience.

                    * * * IMPORTANT * * *

  The files of this GenBank Release are the first for which integer NCBI
"GI" sequence identifiers are no longer presented in the GenBank, GenPept,
and FASTA sequence formats. Users who rely on GIs need to transition to 
Accession.Version identifiers. The following NCBI News articles may be of
interest:

  https://www.ncbi.nlm.nih.gov/news/12-23-2016-ncbi-insights-bulk-converting-gis/
  https://www.ncbi.nlm.nih.gov/news/12-06-2016-ncbi-insights-convert-gi-accver/
  https://www.ncbi.nlm.nih.gov/news/10-17-2016-gi-numbers-removed/

                    * * * IMPORTANT * * *

  Release 219.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.

  As a general guideline, we suggest first transferring the GenBank
release notes (gbrel.txt) whenever a release is being obtained. Check
to make sure that the date and release number in the header of the
release notes are current (eg: April 15 2017, 219.0). If they are
not, interrupt the remaining transfers and then request assistance from
the NCBI Service Desk.

  A comprehensive check of the headers of all release files after your
transfers are complete is also suggested. Here's how one might go about
this on a Unix or Linux platform, using csh/tcsh :

        set files = `ls gb*.*`
        foreach i ($files)
                head -10 $i | grep Release
        end

Or, if the files are compressed, perhaps:

        gzcat $i | head -10 | grep Release

  If you encounter problems while ftp'ing or uncompressing Release
219.0, please send email outlining your difficulties to:

        info from ncbi.nlm.nih.gov

Mark Cavanaugh, Michael Kimelman, Ilya Dondoshansky, Sergey Zhdanov,
GenBank
NCBI/NLM/NIH/HHS

1.3 Important Changes in Release 219.0

1.3.1 GI sequence identifiers have been removed from GenBank/GenPept/FASTA
      formats and the FASTA header has been simplified

  As of March 15 2017, the integer sequence identifiers known as "GIs" were
no longer included in the GenBank, GenPept, and FASTA formats for GenBank
Update products. The FASTA header has been further simplified, to report only
the sequence Accession.Version for records that originate within the
International Sequence Database Collaboration (INSDC).

  And as of April 15 2017, this GenBank Release 219.0 is the first one to
follow that same policy. 

  Previously-assigned GI sequence identifiers will continue to exist
'behind the scenes', and NCBI services which accept GIs as inputs will
continue to be supported. NCBI will be adding support for Accession.Version
identifiers to any services that currently do not support them. As NCBI
makes this transition, we encourage any users who have workflows that
depend on GIs to make use of Accession.Version identifiers instead.

  The FASTA format has also been changed for sequence records originating
within the INSDC, to report only the Accession.Version and the record title.
This will improve compatibility with other file types provided by NCBI and
others, including GFF3, Gene, and dbSNP download files. This FASTA format
change has already been made for the redesigned genomes FTP site based on
user requests to have a single consistent sequence identifier for both GFF3
and FASTA formats.

  At this time, we plan to continue to provide database source information in
the FASTA header/definition line for non-INSDC sources of sequence data,
including UniProt, PDB structures, PIR, and Patent sequences.

Example 1 : An INSDC nucleotide record

  In the sample record below, nucleotide sequence AF123456 was assigned a
GI of 6633795, and the protein translated from its coding region feature
was assigned a GI of 6633796 :

LOCUS       AF123456                1510 bp    mRNA    linear   VRT 12-APR-2012
DEFINITION  Gallus gallus doublesex and mab-3 related transcription factor 1
            (DMRT1) mRNA, partial cds.
ACCESSION   AF123456
VERSION     AF123456.2  GI:6633795
....
     CDS             <1..936
                     /gene="DMRT1"
                     /note="cDMRT1"
                     /codon_start=1
                     /product="doublesex and mab-3 related transcription factor
                     1"
                     /protein_id="AAF19666.1"
                     /db_xref="GI:6633796"
                     /translation="PAAGKKLPRLPKCARCRNHGYSSPLKGHKRFCMWRDCQCKKCSL
                     IAERQRVMAVQVALRRQQAQEEELGISHPVPLPSAPEPVVKKSSSSSSCLLQDSSSPA
                     HSTSTVAAAAASAPPEGRMLIQDIPSIPSRGHLESTSDLVVDSTYYSSFYQPSLYPYY
                     NNLYNYSQYQMAVATESSSSETGGTFVGSAMKNSLRSLPATYMSSQSGKQWQMKGMEN
                     RHAMSSQYRMCSYYPPTSYLGQGVGSPTCVTQILASEDTPSYSESKARVFSPPSSQDS
                     GLGCLSSSESTKGDLECEPHQEPGAFAVSPVLEGE"

  The Accession.Version is now the sole sequence version identifier. The GI
value on the VERSION line and the GI /db_xref qualifier for the coding region
feature are no longer displayed:

LOCUS       AF123456                1510 bp    mRNA    linear   VRT 12-APR-2012
DEFINITION  Gallus gallus doublesex and mab-3 related transcription factor 1
            (DMRT1) mRNA, partial cds.
ACCESSION   AF123456
VERSION     AF123456.2
....
     CDS             <1..936
                     /gene="DMRT1"
                     /note="cDMRT1"
                     /codon_start=1
                     /product="doublesex and mab-3 related transcription factor
                     1"
                     /protein_id="AAF19666.1"
                     /translation="PAAGKKLPRLPKCARCRNHGYSSPLKGHKRFCMWRDCQCKKCSL
                     IAERQRVMAVQVALRRQQAQEEELGISHPVPLPSAPEPVVKKSSSSSSCLLQDSSSPA
                     HSTSTVAAAAASAPPEGRMLIQDIPSIPSRGHLESTSDLVVDSTYYSSFYQPSLYPYY
                     NNLYNYSQYQMAVATESSSSETGGTFVGSAMKNSLRSLPATYMSSQSGKQWQMKGMEN
                     RHAMSSQYRMCSYYPPTSYLGQGVGSPTCVTQILASEDTPSYSESKARVFSPPSSQDS
                     GLGCLSSSESTKGDLECEPHQEPGAFAVSPVLEGE"

Example 2 : A GenPept record for an INSDC sequence

  The GenPept display format previously included GI identifiers in the VERSION
lines (note that the coding region feature for GenPept has never included any
mention of the protein GI identifiers) :

LOCUS       AAF19666                 311 aa            linear   VRT 12-APR-2012
DEFINITION  doublesex and mab-3 related transcription factor 1, partial [Gallus
            gallus].
ACCESSION   AAF19666
VERSION     AAF19666.1  GI:6633796
DBSOURCE    accession AF123456.2
....
     CDS             1..311
                     /gene="DMRT1"
                     /coded_by="AF123456.2:<1..936"

The VERSION line now no longer includes the GI identifier:

LOCUS       AAF19666                 311 aa            linear   VRT 12-APR-2012
DEFINITION  doublesex and mab-3 related transcription factor 1, partial [Gallus
            gallus].
ACCESSION   AAF19666
VERSION     AAF19666.1
DBSOURCE    accession AF123456.2
....
     CDS             1..311
                     /gene="DMRT1"
                     /coded_by="AF123456.2:<1..936"

Example 3: FASTA format for an INSDC nucleotide and protein sequence

  Previously, the FASTA display for most products included GI and database
source information (eg, 'gb' for GenBank, 'emb' for ENA, 'dbj' for
DDBJ), using the '|' character as a delimiter:

>gi|6633795|gb|AF123456.2| Gallus gallus doublesex and mab-3 related transcription factor 1 (DMRT1) mRNA, partial cds
CCGGCGGCGGGCAAGAAGCTGCCGCGTCTGCCCAAGTGTGCCCGCTGCCGCAACCACGGCTACTCCTCGC
CGCTGAAGGGGCACAAGCGGTTCTGCATGTGGCGGGACTGCCAGTGCAAGAAGTGCAGCCTGATCGCCGA
[....]

>gi|6633796|gb|AAF19666.1| doublesex and mab-3 related transcription factor 1, partial
[Gallus gallus]
PAAGKKLPRLPKCARCRNHGYSSPLKGHKRFCMWRDCQCKKCSLIAERQRVMAVQVALRRQQAQEEELGI
SHPVPLPSAPEPVVKKSSSSSSCLLQDSSSPAHSTSTVAAAAASAPPEGRMLIQDIPSIPSRGHLESTSD
LVVDSTYYSSFYQPSLYPYYNNLYNYSQYQMAVATESSSSETGGTFVGSAMKNSLRSLPATYMSSQSGKQ
WQMKGMENRHAMSSQYRMCSYYPPTSYLGQGVGSPTCVTQILASEDTPSYSESKARVFSPPSSQDSGLGC
LSSSESTKGDLECEPHQEPGAFAVSPVLEGE

  Since March 15 2017, and with this April 2017 GenBank Release, just the
Accession.Version will be provided:

>AF123456.2 Gallus gallus doublesex and mab-3 related transcription factor 1 (DMRT1) mRNA, partial cds
CCGGCGGCGGGCAAGAAGCTGCCGCGTCTGCCCAAGTGTGCCCGCTGCCGCAACCACGGCTACTCCTCGC
CGCTGAAGGGGCACAAGCGGTTCTGCATGTGGCGGGACTGCCAGTGCAAGAAGTGCAGCCTGATCGCCGA
[....]

>AAF19666.1 doublesex and mab-3 related transcription factor 1, partial
[Gallus gallus]
PAAGKKLPRLPKCARCRNHGYSSPLKGHKRFCMWRDCQCKKCSLIAERQRVMAVQVALRRQQAQEEELGI
SHPVPLPSAPEPVVKKSSSSSSCLLQDSSSPAHSTSTVAAAAASAPPEGRMLIQDIPSIPSRGHLESTSD
LVVDSTYYSSFYQPSLYPYYNNLYNYSQYQMAVATESSSSETGGTFVGSAMKNSLRSLPATYMSSQSGKQ
WQMKGMENRHAMSSQYRMCSYYPPTSYLGQGVGSPTCVTQILASEDTPSYSESKARVFSPPSSQDSGLGC
LSSSESTKGDLECEPHQEPGAFAVSPVLEGE

Please direct any inquiries about these changes to the NCBI Service Desk:

  info from ncbi.nlm.nih.gov

1.3.2 Organizational changes

  The total number of sequence data files increased by 42 with this release:

  - the BCT division is now composed of 350 files (+20)
  - the CON division is now composed of 359 files (+3)
  - the ENV division is now composed of  97 files (+2)
  - the EST division is now composed of 483 files (+2)
  - the INV division is now composed of 153 files (+1)
  - the PAT division is now composed of 290 files (+7)
  - the PHG division is now composed of   4 files (+1)
  - the PLN division is now composed of 145 files (+2)
  - the PRI division is now composed of  56 files (+1)
  - the SYN division is now composed of  10 files (+1)
  - the TSA division is now composed of 230 files (+1)
  - the VRL division is now composed of  48 files (+1)

1.3.3 Invalid flatfile entry in GenBank 218.0 corrected : KX396599

  A user at Chemical Abstracts Services helpfully reported a formatting error
for GenBank sequence record KX396599 in GenBank Release 218.0 . The record
appeared with the KEYWORDS linetype in the wrong column:

LOCUS       KX396599                8308 bp    DNA     linear   PLN 18-JAN-2017
DEFINITION  Marshallia obovata retrotransposon del/tekay, complete sequence.
ACCESSION   KX396599
VERSION     KX396599.1  GI:1131742074

            KEYWORDS.
SOURCE      Marshallia obovata

  This was caused by the presence of an invalid keyword, consisting of just
a period. The data problem was fixed, but time constraints prevented us from
issuing a patch for the affected file ( gbpln130.seq ). Nonetheless, we
appreciate the scrutiny provided by GenBank users, and we do follow-up on all
problem reports. Thank you CAS!

1.3.4 GSS File Header Problem

  GSS sequences at GenBank are maintained in two different systems, depending
on their origin, and the dumps from those systems occur in parallel. Because
the second dump (for example) has no prior knowledge of exactly how many GSS
files will be dumped by the first, it does not know how to number its own
output files.

  There is thus a discrepancy between the filenames and file headers for 130
of the GSS flatfiles in Release 219.0. Consider gbgss174.seq :

GBGSS1.SEQ          Genetic Sequence Data Bank
                           April 15 2017

                NCBI-GenBank Flat File Release 219.0

                           GSS Sequences (Part 1)

   87034 loci,    63855245 bases, from    87034 reported sequences

  Here, the filename and part number in the header is "1", though the file
has been renamed as "174" based on the number of files dumped from the other
system.  We hope to resolve this discrepancy at some point, but the priority
is certainly much lower than many other tasks.

1.4 Upcoming Changes

1.4.1 No changes impacting GenBank Release content are currently planned.





More information about the Genbankb mailing list

Send comments to us at biosci-help [At] net.bio.net