IUBio Biosequences .. Software .. Molbio soft .. Network News .. FTP

GenBank Release 136.0 Now Available

Mark Cavanaugh cavanaug at ncbi.nlm.nih.gov
Tue Jun 17 20:10:06 EST 2003


Greetings GenBank Users,

  GenBank Release 136.0 is now available via ftp from the National Center
for Biotechnology Information (NCBI):

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ftp.ncbi.nih.gov   genbank     GenBank Release 136.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 136.0

  Uncompressed, the Release 136.0 flatfiles require approximately 107 GB
(sequence files only) or 121 GB (including the 'short directory' and
'index' files).  The ASN.1 version requires approximately 95 GB. From the
release notes:

   Release  Date       Base Pairs   Entries

   135      Apr 2003   31099264455  24027936
   136      Jun 2003   32528249295  25592865

  Close-of-data was 06/12/2003. Four working days were required to prepare
this release. In the eight week period between the close dates for GenBank
releases 135.0 and 136.0, GenBank grew by 1,428,984,840 basepairs and by
1,564,929 sequence records. During that same period, 98,374 records were
updated. Combined, this yields an average of about 26,000 new/updated
records per day.

  We would like to remind our users that GenBank mirrors are available
at ftp://genbank.sdsc.edu/pub and ftp://bio-mirror.net/biomirror/genbank .
Those who experience slow FTP transfers of large files (entire releases, the
GenBank Cumulative Update, etc) might realize an improvement in transfer
rates from these alternate sites when traffic at the NCBI is heavy.

  For additional release information, see the README files in either of the
directories mentioned above, and the release notes (gbrel.txt) in the
genbank directory. Sections 1.3 and 1.4 of the release notes (Changes in
Release 136.0 and Upcoming Changes) have been appended below.

  *NOTE* Section 1.4.1 discusses a very important change : the removal
of sequence length limits for all classes of GenBank sequence records,
as of June 2004. We strongly encourage all users to review this information.

  Release 136.0 data, and subsequent updates, are available now via NCBI's
Entrez and Blast services.

  If you encounter problems while ftp'ing or uncompressing Release 136.0,
please send email outlining your difficulties to info at ncbi.nlm.nih.gov .

Mark Cavanaugh, Vladimir Alekseyev, Anton Butanaev, Michael Kimelman
GenBank
NCBI/NLM/NIH

1.3 Important Changes in Release 136.0

1.3.1 Organizational changes

  The total number of sequence data files increased by 26 with this release:

  - the EST division is now comprised of 259 files (+15)
  - the GSS division is now comprised of 77 files  (+7)
  - the PLN division is now comprised of 8 files   (+1)
  - the ROD division is now comprised of 8 files   (+1)
  - the STS division is now comprised of 3 files   (+1)
  - the VRT division is now comprised of 3 files   (+1)

1.3.2 Erratum : Release 135.0 release notes (April 2003)

  The description of the /locus_tag qualifier in the GenBank 135.0
release notes erroneously states that it is intended for use on
the source feature. In fact, /locus_tag is used for gene, coding
region, and other features, and is not utilized by the source
feature. See the Feature Table documentation for complete details
regarding qualifier usage:

	http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html

  In addition, we noticed that the release notes for GenBank 133.0
(December, 2002) and GenBank 134.0 (February 2003) referred to the
upcoming April 2003 release as number 134.0 rather than 135.0 . 

  Our apologies for any confusion that these errors may have caused,
especially regarding the timeline for introduction of /locus_tag,
/mol_type, and /segment .

1.3.3 GSS File Header Problem

  GSS sequences at GenBank are maintained in one of two different systems,
depending on their origin. One recent change to release processing involves
the parallelization of the dumps from those systems. Because the second dump
(for example) has no prior knowledge of exactly how many GSS files will be
dumped from the first, it doesn't know how to number it's own output files.

  There is thus a discrepancy between the filenames and file headers of eleven
GSS flatfiles in Release 136.0. Consider the gbgss67.seq file:

GBGSS1.SEQ           Genetic Sequence Data Bank
                            June 15 2003

                NCBI-GenBank Flat File Release 136.0

                           GSS Sequences (Part 1)

   86693 loci,    65544901 bases, from    86693 reported sequences

  Here, the filename and part number in the header is "1", though the file
has been renamed as "67" based on the files dumped from the other system.

  We will work to resolve this discrepancy in future releases, but the
priority is certainly much lower than many other tasks.

1.4 Upcoming Changes

1.4.1 Sequence Length Limitation To Be Removed In June 2004

  At the May 2003 collaborative meeting among representatives of GenBank,
EMBL, and DDBJ, it was decided that the 350 kilobase limit on the sequence
length of database records will be removed as of June 2004.

  Individual, complete sequences are currently expected to be a maximum
of 350 kbp in length. One major reason for the existence of this limit is
as an aid to users of sequence analysis software, some of which might not
be capable of processing megabase-scale sequences.

  However, very significant exceptions to the 350 kbp limit have existed
for several years; Phase 1 (unordered, unoriented) and Phase 2 (ordered,
oriented) high-throughput genomic sequences (HTGS) generated by efforts
such as the Human Genome Project; large dispersed eukaryotic genes with
an intron/exon structure that spans more than 350 kbp; and sequences
which result from assemblies of Whole Genome Shotgun (WGS) project data.

  Given these exceptions, and the technological advances which have made
large-scale sequencing practical for an increasing number of researchers,
the collaboration has decided that the 350 kbp limit must be removed.

  As of June 2004, the length of database sequences will be limited only
by the natural structures of an organism's genome. For example, a single
record might be used to represent all of human chromosome 1, which is
approximately 245 Mbp in length.

  Software developers for some of the larger commercial sequence analysis
packages were recently asked what timeframe would be appropriate for this
change. Answers ranged from "immediately", to "several months", to "one year".
So 
which result from assemblies of Whole Genome Shotgun (WGS) project data.

  Given these exceptions, and the technological advances which have made
large-scale sequencing practical for an increasing number of researchers,
the collaboration has decided that the 350 kbp limit must be removed.

  As of June 2004, the length of database sequences will be limited only
by the natural structures of an organism's genome. For example, a single
record might be used to represent all of human chromosome 1, which is
approximately 245 Mbp in length.

  Software developers for some of the larger commercial sequence analysis
packages were recently asked what timeframe would be appropriate for this
change. Answers ranged from "immediately", to "several months", to "one year".
So 
which result from assemblies of Whole Genome Shotgun (WGS) project data.

  Given these exceptions, and the technological advances which have made
large-scale sequencing practical for an increasing number of researchers,
the collaboration has decided that the 350 kbp limit must be removed.

  As of June 2004, the length of database sequences will be limited only
by the natural structures of an organism's genome. For example, a single
record might be used to represent all of human chromosome 1, which is
approximately 245 Mbp in length.

  Software developers for some of the larger commercial sequence analysis
packages were recently asked what timeframe would be appropriate for this
change. Answers ranged from "immediately", to "several months", to "one year".
So the one-year timeframe was selected, to provide ample time to implement
changes which megabase-scale sequences may require.

  Some sample records with very large sequences have been made available
so that developers can begin to test their software modifications:

	ftp://ftp.ncbi.nih.gov/genbank/LargeSeqs

  Many changes are expected after the removal of the length limit. For 
example, complete bacterial genomes (typically on the order of several
megabases) will be re-assembled into single sequence records. The submission
process for such genomes will become much more streamlined, since database
staff will no longer have to split the genomes into pieces. BLAST services
will be enchanced, so that hits reported within very large sequences will
be presented in a meaningful context.

  All such changes will be discussed more fully in future release notes,
the NCBI newsletter, and the GenBank newsgroup.

1.4.2 BASECOUNT line to be dropped

  The BASECOUNT line of the GenBank flatfile format provides totals for
the number of A, T, G, C, and 'other' basepairs that are present within
the sequence of a database record. For example:

LOCUS       AY244763                5686 bp    DNA     linear   BCT 10-APR-2003
DEFINITION  Rhodococcus sp. DS7 cysDNCQ operon, complete sequence.
ACCESSION   AY244763
VERSION     AY244763.1  GI:29725657
....
BASE COUNT     1137 a   1661 c   1821 g   1066 t      1 others
ORIGIN      
        1 cgcggtttgt gacgtctgat tgccggtcat tgacctttgg gtagaacgag ttctattctg
       61 tgattgcgtt caatttagaa ccagtccggt acataaatgt accgatgcgg aaatggtgtt
....
     5281 tgtcagctcg gtgtctggng gcgaggctaa gcaccaacgg cttcggtagc agaaccacat

  This information is computationally expensive to produce for very large
sequences, and for sequences in the CON division. In the CON division case,
a record might be comprised of 'pointers' to hundreds, or even thousands,
of underlying GenBank records. So to calculate the BASECOUNT line content,
retrievals of sequence data for those many records must be performed. This
can noticeably impact the response time for flatfile generation within the
Entrez application.

Hence, as of the October 2003 GenBank Release (138.0), the BASECOUNT linetype
will no longer be present in GenBank Release and GenBank Update products.

Depending on demand, a display option might be implemented in Entrez which
allows users to choose to have BASECOUNT shown.

1.4.3 New oriT feature

  As of Release 138.0 in October 2003, a new feature key (oriT) will be
legal for the feature table. Preliminary documentation for this new
feature is available:

	Feature Key            oriT

	Definition             origin of transfer; region of a plasmid where
                               transfer is initiated during the process of
                               conjugation or mobilisation.

	Mandatory qualifiers:  None

	Optional Qualifiers:   /bound_moiety="text"
                               /citation=[number]
                               /db_xref="<database>:<identifier>"
                               /direction=value
                               /evidence=<evidence_value>
                               /gene="text"
                               /label=feature_label
                               /locus_tag="text" (single token)
                               /map="text"
                               /note="text"
                               /rpt_family="text"
                               /rpt_type=<repeat_type>
                               /rpt_unit=<feature_label>
                               /standard_name="text"
                               /usedin=accnum:feature_label

	Molecule Scope:        DNA

	Comments:              rep_origin should be used for origins
                               of replication; /direction has 
                               legal values RIGHT, LEFT and BOTH,
                               however only RIGHT and LEFT are valid
                               when used in conjunction with the oriT
                               feature 

1.4.4 New /ecotype qualifier

  As of the October 2003 GenBank Release (138.0), a new source feature
qualifier called /ecotype will begin to be used. The preliminary
definition for /ecotype is :

        Qualifier       /ecotype=    
        Definition      A distinct population of organisms of a
                        widespread species that has adapted
                        gentically to its own local habitat.
                        Nevertheless, they can still reproduce
                        with members of other ecotypes of the
                        same species.
        Value format    "text"
        Example         /ecotype="Columbia

        Comment         'Ecotype' is often applied to standard
                        genetic stocks of Arabidopsis thaliana,
                        but it can be applied to any organism,
                        especially sessile organisms like plants.


1.4.5 Change to value format of /rpt_unit

  As of Release 138.0 in October 2003, the value-format of the /rpt_unit
qualifier will be changed to allow 'text' . The currently documented
format is:

	Value format    <feature_label>  or  <base_range>
	Example         /rpt_unit=Alu_rpt1
	                /rpt_unit=202..245
	Comment         used to indicate feature which defines (or base range 
of) the
	                repeat unit of which a repeat region is made

  However, a very common value for /rpt_unit is a literal sequence
string that represents the repeating unit(s). For example:

	/rpt_unit=ta
	/rpt_unit=ac;ag

So the format of this qualifier will be changed to:

	Value format    "text"  or  <feature_label>  or  <base_range>

Existing feature label and base range values will eventually presented
as text values.

1.4.6 New operon feature

  Starting with the October 2003 release (138.0), a new feature will be
legal for the feature table:

	Feature Key:           operon

	Definition:            region containing polycistronic transcripts
                               and regulatory sequences containing genes
                               that encode enzymes that are in the same
                               metabolic pathway  

	Optional qualifiers:   /allele="text"
                               /citation=[number]
                               /db_xref="<database>:<identifier>"
                               /evidence=<evidence_value>
                               /function="text"
                               /operon="text"
                               /label=feature_label
                               /locus_tag="text" (single token)
                               /map="text"
                               /note="text"
                               /product="text"
                               /pseudo
                               /phenotype="text"
                               /standard_name="text"
                               /usedin=accnum:feature_label

  In bacteria, many genes encoding for specific biosynthetic pathways are
transcribed in polycistronic operons. It has been challenging to reflect
this biology within the GenBank flatfile format. GenBank has been using
a gene feature that spans the entire regulatory region and all of the coding
regions and then gene features corresponding to the individual genes spaning
the coding genes.

  The new operon feature will simplify the annotation of cases like
these. Examples of the new operon feature will be provided in future
release notes.

1.4.7 [er] prefix for JOURNAL line

  As of Release 138.0 in October 2003, a new prefix will be legal for the
JOURNAL line:  [er] .

  This prefix is an abbreviation for Electronic Resource, which is a 
term that describes journal articles that are available on-line.

  In 1999, an interim 'Online Publication' REFERENCE format was adopted for
use at GenBank in order to cite articles appearing only electronically:

      REFERENCE   1  (bases 1 to 2858)
        AUTHORS   Smith, J.
        TITLE     Cloning and expression of a phospholipase gene
        JOURNAL   Online Publication
        REMARK    Online-Journal-name; Article Identifier; URL

  In subsequent years, no standards for citing on-line journal articles
have emerged from library organizations.

  One such library organization (National Library of Medicine, NIH) is now
assigning identifiers (Medline UIs and PubMed Ids) to articles published
on-line, and it is presenting these articles in a manner that is identical
to print-journal articles. For example:

      REFERENCE   1
        AUTHORS   Haas,B.J., Volfovsky,N., Town,C.D., Troukhan,M., 
Alexandrov,N.,
                  Feldmann,K.A., Flavell,R.B., White,O. and Salzberg,S.L.
        TITLE     Full-length messenger RNA sequences greatly improve genome
                  annotation
        JOURNAL   Genome Biol. 3 (6), RESEARCH0029 (2002)
        MEDLINE   22088475
         PUBMED   12093376

  Although these citations may contain journal abbreviations, volume numbers,
issue/part/supplement numbers, pages, and year (just like a print-journal
citation), there is no guarantee that the contents of these fields will be 
comparable to those of print-journal citations.

  In the case above, although the page number is a bit unusual
("RESEARCH0029"), software processing the JOURNAL line would probably still
be able to parse its contents. But there is also a possibility that these 
fields could contain unusual characters (embedded spaces, commas, parentheses),
and possibly even URLs. So the addition of [er] :

        JOURNAL   [er] Genome Biol. 3 (6), RESEARCH0029 (2002)

will act as a warning (primarily to software) that the contents of the
JOURNAL line might not be as parsable as a print-journal JOURNAL line.

1.4.8 Accession format of WGS records

  Whole Genome Shotgun (WGS) sequences utilize an accession number format which
is different from those used for non-WGS GenBank sequences. This format is
referred to as 4 + 2 + 6, and is comprised of:

  - a 4-letter WGS project code
  - a 2-digit assembly-version number
  - a 6 (and sometimes 7) digit sequence number

  Because of their unique nature, WGS sequences are kept separate from other
GenBank products:

	ftp://ftp.ncbi.nih.gov/genbank/wgs

  For example, sequences much larger than the current 350 kbp limit can be
generated during the WGS assembly phase. In addition, there is no tracking
of nucleotide sequences from one assembly to the next. So the accessions
of one:

	AAAB01000001
	AAAB01000002
	AAAB01000003
	....

are not necessarily related in any way to those of the next assembly:

	AAAB02000001
	AAAB02000002
	AAAB02000003
	....

  When a WGS project is completed, it is possible that the submittors may chose
to submit a single finished WGS sequence with a 4 + 2 + 6 accession, at which
point it would appear in the non-WGS portion of GenBank.

  Alternately, the submittors might chose to submit the completed genome via
a non-WGS method, in which case a de-novo non-WGS accession would be assigned.
That record would then have one or more 2 + 4 + 6 WGS accessions as
secondary accessions.

  Both scenarios are likely to occur, especially after the 350 kbp sequence
length restriction is lifted. So we felt it was important to alert users
that WGS accessions will eventually be encountered in the non-WGS portion
of GenBank, as primary or secondary accession numbers.


---


- gttaacaattaaagagtgtttatcgaaattcattatatagtggtttatatagaccacttc
-
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/       
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb      
- GenBank on the WWW, see:  http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis at cmmt.ubc.ca                  





More information about the Genbankb mailing list

Send comments to us at biosci-help [At] net.bio.net