GenBank Release 143.0 Now Available

Mark Cavanaugh cavanaug at ncbi.nlm.nih.gov
Thu Aug 19 00:51:37 EST 2004


Greetings GenBank Users,

  GenBank Release 143.0 is now available via ftp from the National
Center for Biotechnology Information (NCBI):

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ftp.ncbi.nih.gov   genbank     GenBank Release 143.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 143.0

  Close-of-data was 08/13/2004. Six days were required to build Release 143.0.
Uncompressed, the Release 143.0 flatfiles require approximately 142 GB
(sequence files only) or 160 GB (including the 'short directory' and
'index' files).  The ASN.1 version requires approximately 124 GB. From
the release notes:

   Release  Date       Base Pairs   Entries

   142      Jun 2004   40325321348  35532003
   143      Aug 2004   41808045653  37343937

In the eight week period between the close dates for GenBank Releases 142.0
and 143.0, the non-WGS portion of GenBank grew by 1,482,724,305 basepairs
and by 1,811,934 sequence records. During that same period, 592,523 records
were updated. Combined, this yields an average of about 42,200 new and/or
updated records per day.

  Between releases 142.0 and 143.0, the WGS component of GenBank grew by
2,535,853,481 basepairs and by 73,883 sequence records.

  We would like to remind our users that GenBank mirrors are available
at ftp://genbank.sdsc.edu/pub and ftp://bio-mirror.net/biomirror/genbank.
Those who experience slow FTP transfers due to a high volume of traffic at
NCBI might realize an improvement in transfer rates from these alternate sites.

  For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 143.0 and Upcoming Changes) have been appended
below.

  Release 143.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.

  If you encounter problems while ftp'ing or uncompressing Release
143.0, please send email outlining your difficulties to
info at ncbi.nlm.nih.gov .

Mark Cavanaugh, Vladimir Alekseyev, Aleksey Vysokolov, Michael Kimelman
GenBank
NCBI/NLM/NIH


1.3 Important Changes in Release 143.0

1.3.1 Organizational changes

  The total number of sequence data files increased by 23 with this release:

  - the BCT division is now comprised of 10 files  (+1)
  - the EST division is now comprised of 335 files (+14)
  - the GSS division is now comprised of 116 files (+6)
  - the HTC division is now comprised of 6 files   (+2)
  - the HTG division is now comprised of 61 files  (-1)
  - the VRT division is now comprised of 7 files   (+1)


1.3.2 GSS File Header Problem

  GSS sequences at GenBank are maintained in one of two different systems,
depending on their origin. One recent change to release processing involves
the parallelization of the dumps from those systems. Because the second dump
(for example) has no prior knowledge of exactly how many GSS files will be
dumped from the first, it doesn't know how to number it's own output files.

  There is thus a discrepancy between the filenames and file headers for
eighteen GSS flatfiles in Release 143.0. Consider the gbgss96.seq file:

GBGSS1.SEQ           Genetic Sequence Data Bank
                            August 15 2004

                NCBI-GenBank Flat File Release 143.0

                           GSS Sequences (Part 1)

   88251 loci,    65633265 bases, from    88251 reported sequences

  Here, the filename and part number in the header is "1", though the file
has been renamed as "96" based on the files dumped from the other system.

  We will work to resolve this discrepancy in future releases, but the
priority is certainly much lower than many other tasks.

1.4 Upcoming Changes

1.4.1 New qualifier : /old_locus_tag

  The /locus_tag qualifier was introduced in April 2003 to provide
a method for systematically identifying genes, coding regions and
other features which typically result from computational analysis.
This qualifier is often used instead of /gene .

  Sometimes the /locus_tag identifier series supplied by a submitter
of sequence data undergoes a change. Because the original /locus_tag
identifiers might be referenced in journal articles, or in databases,
a means of presenting the original identifiers is needed.

  A new qualifier, /old_locus_tag , will be introduced as of October
2004 for this purpose. A formal description of the qualifier will be
made available via upcoming GenBank release notes, and via the GenBank
newsgroup.

1.4.2 New type of gap() operator

  CON-division records utilize a CONTIG line with a join() statement,
which specifies how sequences can be combined to form a much larger
object. For example:

LOCUS       AE016959            23508449 bp    DNA     linear   CON 12-JUN-2003
DEFINITION  Oryza sativa (japonica cultivar-group) chromosome 10, complete
            sequence.
ACCESSION   AE016959
[....]
CONTIG      join(AE017047.1:1..300029,AE017048.1:61..300029,
            AE017049.1:61..304564,AE017050.1:61..302504,AE017051.1:61..300029,
            AE017052.1:61..300029,AE017053.1:61..303511,AE017054.1:61..302085,
            AE017055.1:61..300029,AE017056.1:61..300029,AE017057.1:61..300029,
            AE017058.1:61..300029,AE017059.1:61..300029,AE017060.1:61..95932,
            gap(30001),AE017061.1:1..300028,AE017062.1:61..306096,
[....]

A gap operator is legal in these join statements:

  gap()  : indicates a gap of unknown length

  gap(N) : where 'N' is a positive integer, indicates a gap with a
           physically-estimated length of 'N' bases.

  In some sequencing projects, a convention is agreed upon by which gaps
of unknown length are all represented by a uniform value, such as 100.

  To capture this usage, a new type of gap operator will be legal as of
October 2004 : 'gap(unkN)', where 'N' is a positive integer. For a gap of
length 100, utilized by convention rather than reflective of the gap's
actual size, the operator would be:

      gap(unk100)

  This new gap operator will make clear the distinction between a
gap with a physically-estimated length, and a gap with a length that
has no actual physical basis. Further details about this new operator
will be made available via these release notes and the GenBank newsgroup.

1.4.3 New /compare qualifier

  Four different features exist which can be used to annotate regions
of sequence that are either uncertain or that differ in comparison
to some other sequence:

	variation
	conflict
	misc_difference
	old_sequence

  A /citation qualifier is used to refer to a publication that details
the nature of the uncertain or differing bases. However, a publication
may not always be available (unpublished references), and simply 
referring to a publication is quite indirect.

  The new /compare qualifer will provide a method for directly 
referencing a base range on a record that exhibits a sequence
difference:

	/compare="Accession.Version:X..Y"

For example:

	/compare="M10101.1:1..5"

  This new qualifier will be legal as of GenBank Release 144.0 in October of
2004. A formal description of /compare will be made available via upcoming
GenBank release notes, and via the GenBank newsgroup.

---


- gttaacaattaaagagtgtttatcgaaattcattatatagtggtttatatagaccacttc
-
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/       
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb      
- GenBank on the WWW, see:  http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis at bioinformatics.ubc.ca                  





More information about the Genbankb mailing list