[Genbank-bb] GenBank Release 185.0 Available : August 17 2011

Cavanaugh, Mark (NIH/NLM/NCBI) [E] via genbankb%40net.bio.net (by cavanaug from ncbi.nlm.nih.gov)
Wed Aug 17 20:49:16 EST 2011


Greetings GenBank Users,

  GenBank Release 185.0 is now available via FTP from the 
National Center for Biotechnology Information (NCBI):

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ftp.ncbi.nih.gov   genbank     GenBank Release 185.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 185.0

  Close-of-data for GenBank 185.0 occurred on 08/14/2011. Uncompressed,
the Release 185.0 flatfiles require roughly 511 GB (sequence files only)
or 550 GB (including the 'short directory', 'index' and the *.txt
files). The ASN.1 data require approximately 420 GB.

Recent statistics for non-WGS, non-CON sequences:

  Release  Date      Base Pairs    Entries

  184      Jun 2011  129178292958  140482268
  185      Aug 2011  130671233801  142284608

Recent statistics for WGS sequences:

  Release  Date      Base Pairs    Entries

  184    Jun 2011  200487078184   63735078
  185    Aug 2011  208315831132   64997137

  During the 46 days between the close dates for GenBank Releases 184.0
and 185.0, the non-WGS/non-CON portion of GenBank grew by 1,492,940,843
basepairs and by 1,802,340 sequence records. During that same period,
970,764 records were updated. An average of 60,285 non-WGS/non-CON
records were added and/or updated per day.

  Between releases 184.0 and 185.0, the WGS component of GenBank grew by
7,828,752,948 basepairs and by 1,262,059 sequence records.

  For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 185.0 and Upcoming Changes) have been appended
below for your convenience.

                ** Important Notes **

*  GenBank 'index' files are now provided without any EST content, and
   without most GSS content. See Section 1.3.3 of the release notes for
   further details.

   NCBI is considering ceasing support for the index files, so we
   encourage affected users to review that section and provide feedback.

  Release 185.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.

  As a general guideline, we suggest first transferring the GenBank
release notes (gbrel.txt) whenever a release is being obtained. Check
to make sure that the date and release number in the header of the
release notes are current (eg: August 15 2011, 185.0). If they are
not, interrupt the remaining transfers and then request assistance from
the NCBI Service Desk.

  A comprehensive check of the headers of all release files after your
transfers are complete is also suggested. Here's how one might go about
this on a Unix or Linux platform, using csh/tcsh :

        set files = `ls gb*.*`
        foreach i ($files)
                head -10 $i | grep Release
        end

Or, if the files are compressed, perhaps:

        gzcat $i | head -10 | grep Release

  If you encounter problems while ftp'ing or uncompressing Release
185.0, please send email outlining your difficulties to:

        info from ncbi.nlm.nih.gov

Mark Cavanaugh, Michael Kimelman, Ilya Dondoshansky, Sergey Zhdanov
GenBank
NCBI/NLM/NIH/HHS


1.3 Important Changes in Release 185.0

1.3.1 Organizational changes

The total number of sequence data files increased by 19 with this release:

  - the BCT division is now composed of  75 files (+3)
  - the ENV division is now composed of  42 files (+2)
  - the EST division is now composed of 447 files (+2)
  - the GSS division is now composed of 248 files (+1)
  - the HTG division is now composed of 135 files (-1)
  - the INV division is now composed of  31 files (+1)
  - the PAT division is now composed of 168 files (+4)
  - the PLN division is now composed of  50 files (+2)
  - the TSA division is now composed of  35 files (+5)

On rare occasions, the number of HTG files decreases when a significant
number of HTG records 'graduate' to Phase 3, at which point they move to
a non-HTG division.

The total number of 'index' files increased by 2 with this release:

  - the AUT (author name) index is now composed of 89 files (+1)
  - the KEY (keyword)     index is now composed of  6 files (+1)

1.3.2 Changes in the content of index files

  As described in the GB 153 release notes, the 'index' files which accompany
GenBank releases (see Section 3.3) are considered to be a legacy data product by
NCBI, generated mostly for historical reasons. FTP statistics from January 2005
seemed to support this: the index files were transferred only half as frequently as
the files of sequence records. The inherent inefficiencies of the index file
format also lead us to suspect that they have little serious use by the user
community, particularly for EST and GSS records.

  The software that generated the index file products received little
attention over the years, and finally reached its limitations in
February 2006 (Release 152.0). The required multi-server queries which
obtained and sorted many millions of rows of terms from several different
databases simply outgrew the capacity of the hardware used for GenBank
Release generation.

  Our short-term solution is to cease generating some index-file content
for all EST sequence records, and for GSS sequence records that originate
via direct submission to NCBI.

  The three gbacc*.idx index files continue to reflect the entirety of the
release, including all EST and GSS records, however the file contents are
unsorted.

  These 'solutions' are really just stop-gaps, and we will likely pursue
one of two options:

a) Cease support of the 'index' file products altogether.

b) Provide new products that present some of the most useful data from
   the legacy 'index' files, and cease support for other types of index data.

  If you are a user of the 'index' files associated with GenBank releases, we
encourage you to make your wishes known, either via the GenBank newsgroup,
or via email to NCBI's Service Desk:

   info from ncbi.nlm.nih.gov

  Our apologies for any inconvenience that these changes may cause.

1.3.3 GSS File Header Problem

  GSS sequences at GenBank are maintained in two different systems, depending
on their origin, and the dumps from those systems occur in parallel. Because
the second dump (for example) has no prior knowledge of exactly how many GSS
files will be dumped by the first, it does not know how to number its own
output files.

  There is thus a discrepancy between the filenames and file headers for
103 of the GSS flatfiles in Release 185.0. Consider gbgss146.seq :

GBGSS1.SEQ          Genetic Sequence Data Bank
                          August 15 2011

                NCBI-GenBank Flat File Release 185.0

                           GSS Sequences (Part 1)

   87126 loci,    64015147 bases, from    87126 reported sequences

  Here, the filename and part number in the header is "1", though the file
has been renamed as "146" based on the number of files dumped from the other
system.  We hope to resolve this discrepancy at some point, but the priority
is certainly much lower than many other tasks.

1.4 Upcoming Changes

1.4.1 Implementation of /whole_replicon qualifier abandoned

  The introduction of a /whole_replicon qualifier was approved by the
International Nucleotide Sequence Database Collaboration during their
annual collaborative meeting in May 2010. However, implementation of the
new qualifer proved more difficult than expected, with a growing and
complex list of conditions under which /whole_replicon would *not* be
appropriate. Rather than continue to define what /whole_replicon is
not intended for, the INSDC has decided to make use of improved submission
processes which allow users to explicitly identify the "genome-level"
molecules (eg, chromosomes) that should be shown in the topmost view
of an organism's genome. Furthermore, given the implementation of
BioProject databases within the INSD, the exchange of project data 
among the INSD members will include provision for indicating, explicitly,
the sequence records which represent "genome-level" molecules. With
these plans in place, it was agreed to abandon plans for the 
/whole_replicon qualifier at the May 2011 INSDC annual meeting.

1.4.2 New centromere and telomere features

  Telomeres and centromeres are essential features of chromosomes and
disrupting their structure affects the viability and life span of an
organism. Centromeric sequence varies from a compact, non-repetitive,
less than 150 base pair region in S. cerevisiae to a highly repetitive 
and complex region of several hundred thousands of base pairs in
eukaryote genomes. The sequence at the telomeric ends is unique compared
to the rest of the chromosome and protects the chromosome ends from
recombination, fusion to other chromosomes or degradation by nucleases.
Currently telomere and centromere features may be under-annotated since
there are no specific feature keys for them, hence the INSDC approved
the creation of two new features at the May 2011 INSDC annual meeting:

Feature Key          centromere
Definition           region of biological interest identified as a centromere 
                     and which have been experimentally characterized;

Optional qualifiers  /note="text"
Comment              the centromere feature describes the interval of DNA 
                     that corresponds to a region where chromatids are held 
                     and a kinetochore is formed; 

Feature Key          telomere
Definition           region of biological interest identified as a telomere 
                     and which have been experimentally characterized;

Optional qualifiers  /note="text"
                     /rpt_unit_seq
                     /rpt_unit_range
                     /rpt_type
                     /mobile_element
Comment              the telomere feature describes the interval of DNA 
                     that corresponds to a specific structure at the end of   
                     the linear eukaryotic chromosome which is required for 
                     the integrity and maintenance of the end; this region is 
                     unique compared to the rest of the chromosome and
                     represent the physical end of the chromosome;

  These two features are intended for use when the centromere or telomere
have been actually been sequenced. These two new features will be legal as
of the GenBank Release 186.0 (October 15 2011).

1.4.3 New assembly_gap feature, and /gap_type and /linkage_evidence qualifiers

  Complete genomes are often submitted to the INSDC via a small (or large)
set of independent sequence records, which can be assembled into chromosomes
and/or scaffolds. The CON-division records representing these scaffolds
and chromosomes are usually built using information provided in "AGP files"
provided by the submitter. See:

   http://www.ncbi.nlm.nih.gov/genome/assembly/agp/AGP_Update.shtml

  The AGP 2.0 specification includes provisions for a variety of different
gap types, as well as information about whether a gap between two
scaffold or chromosome components is an unspanned gap or a spanned gap.
There is also biological gap-types: telomere, centromere and repeat.
AGP 2.0 also supports terminology to describe the type of evidence used
to establish the linkage connecting the components on either side of a
spanned gap within a scaffold or chromosome. Unfortunately, there is no
mechanism to represent any of this information in the Feature Table.

  To address this, the INSDC has decided to implement an assembly_gap
feature, and /gap_type and /linkage_evidence qualifiers, all of which
will be legal as of October 15 2011 (GenBank Release 186.0). 

  Preliminary definitions of the two new qualifiers are as follows:

Qualifier       /gap_type=
Definition      kind of gap connecting components, or the type of biological gaps
Value format    "TYPE"
Example         /gap_type="between scaffolds" 
                /gap_type="within scaffold"
Comment         The qualifier is just for gap features. TYPE is a controlled 
                vocabulary:
	
                "between scaffolds"
                "within scaffold"
                "telomere"
                "centromere"
                "short arm"
                "heterochromatin"
                "repeat within scaffold"
                "repeat between scaffolds"


Qualifier       /linkage_evidence=
Definition      kind of evidence establishing linkage across a gap
Value format    "TYPE"
Example         /linkage_evidence="paired-ends" 
                /linkage_evidence="within_clone"
Comment         The qualifier is just for gap features of type "within 
                scaffold" or "repeat within scaffold". TYPE is a controlled 
                vocabulary, from the new AGP Specification version 2.0 :

"paired_ends"   - paired sequences from the two ends of a DNA fragment.
"align_genus"   - alignment to a reference genome within the same genus.
"align_xgenus"  - alignment to a reference genome within another genus.
"align_trnscpt" - alignment to a transcript from the same species.
"within_clone"  - sequence on both sides of the gap is derived from the 
                  same clone, but the gap is not spanned by paired-ends. 
                  The adjacent sequence contigs have unknown order and 
                  orientation.
"clone_contig"  - linkage is provided by a clone contig in the tiling path 
                  (TPF). For example, a gap where there is a known clone, but 
                  there is not yet sequence for that clone.
"map"           - linkage asserted using a non-sequence based map such as RH, 
                  linkage, fingerprint or optical.
"strobe"        - strobe sequencing (PacBio).
"unspecified"   - used when converting old AGPs that lack a field for linkage 
                  evidence into the new format.

  Because there are existing CON-division records with gaps that are not
based on information derived from an AGP file, it was agreed that a new
feature should be introduced that will make use of these new qualifiers:

	assembly_gap

  A complete definition for this feature is not yet available, but we will
inform GenBank users as soon as it is finalized. Both /gap_type and
/linkage_evidence are expected to be mandatory for the assembly_gap feature.

  The new centromere and telomere features (see Section 1.4.2) should 
only be used when the actual sequence of a centromere/telomere has been
determined. If this is not the case, then an assembly_gap feature with
a /gap_type of "centromere" or "telomere" should be used instead.





More information about the Genbankb mailing list