GenBank Release 144.0 Now Available

Mark Cavanaugh cavanaug at ncbi.nlm.nih.gov
Wed Oct 20 02:29:29 EST 2004


Greetings GenBank Users,

  GenBank Release 144.0 is now available via ftp from the National
Center for Biotechnology Information (NCBI):

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ftp.ncbi.nih.gov   genbank     GenBank Release 144.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 144.0

  Close-of-data was 10/13/2004. Five business days were required to build
Release 144.0. Uncompressed, the Release 144.0 flatfiles require approximately
147 GB (sequence files only) or 166 GB (including the 'short directory' and
'index' files).  The ASN.1 version requires approximately 128 GB. From
the release notes:

   Release  Date       Base Pairs   Entries

   143      Aug 2004   41808045653  37343937
   144      Oct 2004   43194602655  38941263

In the eight week period between the close dates for GenBank Releases 143.0
and 144.0, the non-WGS portion of GenBank grew by 1,386,557,002 basepairs
and by 1,597,326 sequence records. During that same period, 349,631 records
were updated. Combined, this yields an average of about 31,900 new and/or
updated records per day.

  Between releases 143.0 and 144.0, the WGS component of GenBank grew by
2,742,978,532 basepairs and by 857,503 sequence records.

           * * * Important * * * 

  The GenBank mirror located at ftp://genbank.sdsc.edu/pub is out of service
for several weeks. Users should not use the SDSC mirror for GenBank 144.0.
The alternate mirror at ftp://bio-mirror.net/biomirror/genbank remains available.

  As a general guideline, we suggest first transferring the GenBank release
notes (gbrel.txt) whenever a release is being obtained. Check to make sure
that the date and release number in the header of the release notes are current
(October 15 2004, 144.0). If they are not, interrupt the remaining transfers and
then request assistance from the NCBI Service Desk.

  For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 144.0 and Upcoming Changes) have been appended
below.

  Release 144.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.

  If you encounter problems while ftp'ing or uncompressing Release
144.0, please send email outlining your difficulties to
info at ncbi.nlm.nih.gov .

Mark Cavanaugh, Vladimir Alekseyev, Aleksey Vysokolov, Michael Kimelman
GenBank
NCBI/NLM/NIH

1.3 Important Changes in Release 144.0

1.3.1 Organizational changes

  The total number of sequence data files increased by 24 with this release:

  - the EST division is now comprised of 349 files (+14)
  - the GSS division is now comprised of 120 files (+4)
  - the HTG division is now comprised of  62 files (+1)
  - the INV division is now comprised of   7 files (+1)
  - the PAT division is now comprised of  16 files (+1)
  - the PLN division is now comprised of  13 files (+1)
  - the ROD division is now comprised of  14 files (+1)

  In addition, the MAM division has been newly split into two files,
  gbmam1.seq and gbmam2.seq (+1).

1.3.2 New qualifier : /old_locus_tag

  The /locus_tag qualifier was introduced in April 2003 to provide
a method for systematically identifying genes, coding regions and
other features which typically result from computational analysis.
This qualifier is often used instead of /gene .

  Sometimes the /locus_tag identifier series supplied by a submitter
of sequence data undergoes a change. Because the original /locus_tag
identifiers might be referenced in journal articles, or in databases,
a means of presenting the original identifiers is needed.

  So a new qualifier, /old_locus_tag , has been introduced as of this
October 2004 release :

Qualifier       /old_locus_tag
Definition      feature tag assigned for tracking purposes 
Value Format    "text" (single token)
Example         /old_locus_tag="RSc0382"
                /locus_tag="YPO0002"
Comment         /old_locus_tag can be used with any feature where /gene is valid and 
                where a /locus_tag qualifier is present.  
                Identical /old_locus_tag values may be used within an entry/record, 
                but only if the identical /old_locus_tag values are associated 
                with the same gene; in all other circumstances the /old_locus_tag 
                value must be unique within that entry/record. 
                Multiple/old_locus_tag qualifiers with distinct values are 
                allowed within a single feature; /old_locus_tag and /locus_tag 
                values must not be identical within a single feature.

1.3.3 New type of gap() operator

  CON-division records utilize a CONTIG line with a join() statement,
which specifies how sequences can be combined to form a much larger
object. For example:

LOCUS       AE016959            23508449 bp    DNA     linear   CON 12-JUN-2003
DEFINITION  Oryza sativa (japonica cultivar-group) chromosome 10, complete
            sequence.
ACCESSION   AE016959
[....]
CONTIG      join(AE017047.1:1..300029,AE017048.1:61..300029,
            AE017049.1:61..304564,AE017050.1:61..302504,AE017051.1:61..300029,
            AE017052.1:61..300029,AE017053.1:61..303511,AE017054.1:61..302085,
            AE017055.1:61..300029,AE017056.1:61..300029,AE017057.1:61..300029,
            AE017058.1:61..300029,AE017059.1:61..300029,AE017060.1:61..95932,
            gap(30001),AE017061.1:1..300028,AE017062.1:61..306096,
[....]

A gap operator is legal in these join statements:

  gap()  : indicates a gap of unknown length

  gap(N) : where 'N' is a positive integer, indicates a gap with a
           physically-estimated length of 'N' bases.

  In some sequencing projects, a convention is agreed upon by which gaps
of unknown length are all represented by a uniform value, such as 100.

  To reflect this convention, a new type of gap operator is legal as of
October 2004 : 'gap(unkN)', where 'N' is a positive integer. For a gap of
length 100, utilized by convention rather than reflective of the gap's
actual size, the operator would be:

      gap(unk100)

  This new gap operator will make clear the distinction between a
gap with a physically-estimated length, and a gap with a length that
has no actual physical basis.

1.3.3 New /compare qualifier

  Five different features exist which can be used to annotate regions
of sequence that are either uncertain or that differ in comparison
to some other sequence:

	variation
	conflict
	misc_difference
	old_sequence
	unsure

  A /citation qualifier is used to refer to a publication that details
the nature of the uncertain or differing bases. However, a publication
may not always be available (unpublished references), and simply 
referring to a publication is quite indirect.

  The new /compare qualifer provides a method for directly 
referencing a particular sequence that exhibits a sequence
difference:

	/compare="Accession.Version"

For example:

	/compare="M10101.1"

  This new qualifier is legal as of this October 2004 GenBank Release.
The complete description of /compare is as follows:

Qualifier       /compare=
Definition      Reference details of an existing public INSD entry 
                to which a comparison is made
Value format    [accession-number.sequence-version]
Example         /compare=AJ634337.1
Comment         This qualifier may be used on the following features:
                misc_difference, conflict, unsure, old_sequence 
                and variation. The features "old_sequence" and "conflict" must
                have either a /citation or a /compare qualifier. Multiple /compare
                qualifiers with different contents are allowed within a 
                single feature. 
                This qualifier is not intended for large-scale annotation 
                of variations, such as SNPs.

1.3.4 GSS File Header Problem

  GSS sequences at GenBank are maintained in one of two different systems,
depending on their origin. One recent change to release processing involves
the parallelization of the dumps from those systems. Because the second dump
(for example) has no prior knowledge of exactly how many GSS files will be
dumped from the first, it doesn't know how to number it's own output files.

  There is thus a discrepancy between the filenames and file headers for
eighteen GSS flatfiles in Release 144.0. Consider the gbgss100.seq file:

GBGSS1.SEQ           Genetic Sequence Data Bank
                          October 15 2004

                NCBI-GenBank Flat File Release 144.0

                           GSS Sequences (Part 1)

   88260 loci,    65614942 bases, from    88260 reported sequences

  Here, the filename and part number in the header is "1", though the file
has been renamed as "100" based on the files dumped from the other system.

  We will work to resolve this discrepancy in future releases, but the
priority is certainly much lower than many other tasks.

1.4 Upcoming Changes

1.4.1 New gap feature

  A new feature key for sequence gaps will become legal as of the
December 2004 GenBank release:

Feature key           gap

Definition            gap in the sequence
Mandatory qualifiers  /estimated_length=unknown or <integer>
Optional qualifiers   /map="text"
                      /note="text"
Comment               the location span of the gap feature for an unknown 
                      gap is 100 bp, with the 100 bp indicated as 100 "n"s in 
                      the sequence.  Where estimated length is indicated by 
                      an integer, this is indicated by the same number of 
                      "n"s in the sequence. 
                      No upper or lower limit is set on the size of the gap.

1.4.2 Continuous ranges of secondary accessions

  With the removal of sequence length limits, some genomes (typically
bacterial) that had been split into many pieces are gradually being
replaced by a single sequence record. U00096 is a good example.

  When this happens, the accessions of the former small pieces become
secondary accessions for the single large sequence record. When each
secondary is separately listed, the ACCESSION line becomes excessively
lengthy.

  As of GenBank Release 146.0 in February 2005, it will be legal to
represent continuous ranges of secondary accessions by a start accession,
a dash character, and an end accession. In the case of U00096, the
ACCESSION line would thus look like:

	ACCESSION   U00096 AE000111-AE000510

  Further details about the conventions for secondary accession ranges
will be provided via these release notes and the GenBank newsgroup.  

---


- gttaacaattaaagagtgtttatcgaaattcattatatagtggtttatatagaccacttc
-
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/       
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb      
- GenBank on the WWW, see:  http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis at bioinformatics.ubc.ca                  





More information about the Genbankb mailing list