GenBank Release 115.0 Available

Mark Cavanaugh cavanaug at lagrange.nlm.nih.gov
Wed Dec 22 06:31:42 EST 1999


  GenBank Release 115.0 is now available via ftp from the National Center
for Biotechnology Information:

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ncbi.nlm.nih.gov   genbank     GenBank Release 115.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 115.0

  Uncompressed, the Release 115.0 flatfiles require roughly 18917 MB
(sequence files only) or 22026 MB (including the 'index' files). The
ASN.1 version requires roughly 16693 MB. From the release notes:

   Release    Date     Base Pairs   Entries

   114        Oct 99   3841163011   4864570
   115        Dec 99   4653932745   5354511

  The 812 Mbp growth since Release 114.0 is the largest single-release
increase ever experienced by GenBank (the previous record was 441 Mbp).

  Close-of-data was 12/10/99. Twelve days were required to prepare this
release. For additional information, see the README files in either of the
directories mentioned above, and the release notes (gbrel.txt) in the
genbank directory. Sections 1.3 and 1.4 of the release notes (Changes in
Release 115.0 and Upcoming Changes) have been appended below.

  Sections 1.3.2 and 1.3.3 describe particularly important changes related
to NID/PID and the new gbcon.seq division, implemented with this release.

  Release 115.0 data are currently available via NCBI's Entrez and Blast
servers, and the 'query' email server.

  New GenBank cumulative update files (gbcu.flat.Z and gbcu.aso.Z), containing
only those entries new/updated since the Release 115.0 close-of-data, should be
available by 9:00am EST, December 22. Please note that the new CUs will be
smaller than previous versions you might have obtained after Release 114.0 was
posted.

  If you encounter problems while ftp'ing or uncompressing Release 115.0,
please send email outlining your difficulties to info at ncbi.nlm.nih.gov .

Mark Cavanaugh & Vladimir Aleksey
GenBank
NCBI/NLM/NIH

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

1.3 Important Changes in Release 115.0

1.3.1 Organizational changes

  Due to database growth, the PLN division is now being split into three pieces.

  Due to database growth, the EST division is now being split into forty-four
pieces.

  Due to database growth, the HTG division is now being split into seven pieces.

  Due to database growth, the GSS division is now being split into fifteen pieces.

1.3.2 Removal of obsolete NID linetype and PID /db_xref 

  When the Accession.Version system was introduced in the Spring of 1999, we
stated via the GenBank newsgroup that the NID and PID linetypes would
eventually be phased out:

  "The NID linetype and the PID /db_xref qualifier will eventually be
   removed from the GenBank flatfile format, probably by August 1999.
   However, the NCBI GI identifiers that they contain will remain
   available, via the new VERSION linetype and the new GI /db_xref
   qualifier."

  This removal of NID/PID has been implemented with GenBank Release
115.0 (December 1999). Excerpts from a GenBank record which illustrates
the format change are appended below.

  NCBI "GI" sequence identifiers can still be obtained from GenBank flatfiles
via the VERSION linetype and the GI /db_xref qualifier of CDS features.

Previous GenBank flatfile view of the HUMCCLEC1 segmented set:

LOCUS       HUMCCLEC1   17079 bp    DNA             PRI       04-FEB-1999
DEFINITION  Homo sapiens cartilage-derived C-type lectin (CLECSF1) gene, exons
            1 and 2.
ACCESSION   AF077344 AF077343
NID         g3982778
VERSION     AF077344.1  GI:3982778
KEYWORDS    .
SEGMENT     1 of 2
SOURCE      human.
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia;
            Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 17079)
  AUTHORS   Neame,P.J. and Tapp,H.
  TITLE     The Cartilage-Derived, C-type Lectin (CLECSF1); Structure of the
            Human Gene and Chromosomal Location
  JOURNAL   Unpublished
....
FEATURES             Location/Qualifiers
     source          1..17079
                     /organism="Homo sapiens"
                     /db_xref="taxon:9606"
                     /chromosome="16"
                     /map="16q23"
                     /tissue_type="cartilage"
     repeat_region   complement(180..393)
                     /rpt_family="MIR"
....
//
LOCUS       HUMCCLEC2    3362 bp    DNA             PRI       04-FEB-1999
DEFINITION  Homo sapiens cartilage-derived C-type lectin (CLECSF1) gene, exon 3
            and complete cds.
ACCESSION   AF077345
NID         g3386489
VERSION     AF077345.1  GI:3386489
KEYWORDS    .
SEGMENT     2 of 2
SOURCE      human.
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia;
            Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 3362)
  AUTHORS   Neame,P.J. and Tapp,H.
  TITLE     The Cartilage-Derived, C-type Lectin (CLECSF1); Structure of the
            Human Gene and Chromosomal Location
  JOURNAL   Unpublished
....
FEATURES             Location/Qualifiers
     source          1..3362
                     /organism="Homo sapiens"
                     /db_xref="taxon:9606"
                     /chromosome="16"
                     /map="16q23"
                     /tissue_type="cartilage"
     mRNA            join(AF077344.1:10886..11079,AF077344.1:16446..16529,
                     777..1171)
                     /gene="CLECSF1"
                     /product="cartilage-derived C-type lectin"
     gene            order(AF077344.1:10886..17079,1..3362)
                     /gene="CLECSF1"
     CDS             join(AF077344.1:10965..11079,AF077344.1:16446..16529,
                     777..1171)
                     /gene="CLECSF1"
                     /note="similar to tetranectin"
                     /codon_start=1
                     /product="cartilage-derived C-type lectin"
                     /protein_id="AAD12542.1"
                     /db_xref="PID:g3386491"
                     /db_xref="GI:3386491"
                     /translation="MAKNGLVICILVITLLLDQTTSHTSRLKARKHSKRRVRDKDGDL
                     KTQIEKLWTEVNALKEIQALQTVCLRGTKVHKKCYLASEGLKHFHEANEDCISKGGIL
                     VIPRNSDEINALQDYGKRSLPGVNDFWLGINDMVTEGKFVDVNGIAISFLNWDRAQPN
                     GGKRENCVLFSQSAQGKWSDEACRSSKRYICEFTIPQ"
....
//

GenBank flatfile view of the HUMCCLEC1 segmented set as of December 1999.
Note the lack of the NID linetype and the PID /db_xref :

LOCUS       HUMCCLEC1   17079 bp    DNA             PRI       04-FEB-1999
DEFINITION  Homo sapiens cartilage-derived C-type lectin (CLECSF1) gene, exons
            1 and 2.
ACCESSION   AF077344 AF077343
VERSION     AF077344.1  GI:3982778
KEYWORDS    .
SEGMENT     1 of 2
SOURCE      human.
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia;
            Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 17079)
  AUTHORS   Neame,P.J. and Tapp,H.
  TITLE     The Cartilage-Derived, C-type Lectin (CLECSF1); Structure of the
            Human Gene and Chromosomal Location
  JOURNAL   Unpublished
....
FEATURES             Location/Qualifiers
     source          1..17079
                     /organism="Homo sapiens"
                     /db_xref="taxon:9606"
                     /chromosome="16"
                     /map="16q23"
                     /tissue_type="cartilage"
     repeat_region   complement(180..393)
                     /rpt_family="MIR"
....
//
LOCUS       HUMCCLEC2    3362 bp    DNA             PRI       04-FEB-1999
DEFINITION  Homo sapiens cartilage-derived C-type lectin (CLECSF1) gene, exon 3
            and complete cds.
ACCESSION   AF077345
VERSION     AF077345.1  GI:3386489
KEYWORDS    .
SEGMENT     2 of 2
SOURCE      human.
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia;
            Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 3362)
  AUTHORS   Neame,P.J. and Tapp,H.
  TITLE     The Cartilage-Derived, C-type Lectin (CLECSF1); Structure of the
            Human Gene and Chromosomal Location
  JOURNAL   Unpublished
....
FEATURES             Location/Qualifiers
     source          1..3362
                     /organism="Homo sapiens"
                     /db_xref="taxon:9606"
                     /chromosome="16"
                     /map="16q23"
                     /tissue_type="cartilage"
     mRNA            join(AF077344.1:10886..11079,AF077344.1:16446..16529,
                     777..1171)
                     /gene="CLECSF1"
                     /product="cartilage-derived C-type lectin"
     gene            order(AF077344.1:10886..17079,1..3362)
                     /gene="CLECSF1"
     CDS             join(AF077344.1:10965..11079,AF077344.1:16446..16529,
                     777..1171)
                     /gene="CLECSF1"
                     /note="similar to tetranectin"
                     /codon_start=1
                     /product="cartilage-derived C-type lectin"
                     /protein_id="AAD12542.1"
                     /db_xref="GI:3386491"
                     /translation="MAKNGLVICILVITLLLDQTTSHTSRLKARKHSKRRVRDKDGDL
                     KTQIEKLWTEVNALKEIQALQTVCLRGTKVHKKCYLASEGLKHFHEANEDCISKGGIL
                     VIPRNSDEINALQDYGKRSLPGVNDFWLGINDMVTEGKFVDVNGIAISFLNWDRAQPN
                     GGKRENCVLFSQSAQGKWSDEACRSSKRYICEFTIPQ"
....
//

1.3.3 Introduction of the "CON" division

  A new and experimental file called gbcon.seq is included with GenBank Release 115.0.
This data file contains instructions for the assembly of larger-scale objects (eg,
"contigs", hence the new division's name) from individual GenBank records. It is unusual
in that the records in this file contain no sequence data at all. For an overview of this
experimental GenBank division, please see the following NCBI News article:

       http://www.ncbi.nlm.nih.gov/Web/Newsltr/Fall99/contig.html

  One class of data represented in gbcon.seq is derived from "segmented sets". Genomic
DNA sequences can sometimes be incomplete because submittors choose to sequence just the
exons, small portions of the intervening introns, and perhaps some upstream or downstream
control regions. These related sequences are labelled "1 of N", "2 of N", etc, via
the SEGMENT line in the GenBank flatfile format. There is often a coding region feature
on the last member, with a join() location pointing to exon intervals that contribute
to the coding sequence. Gaps of unknown size typically exist between the sequenced pieces.

  Entries in the new gbcon.seq file make the relationships among such pieces more explicit.
In addition, these entries have accession numbers, version numbers, and NCBI GIs, just like
regular GenBank records. Here's an example illustrating these points:

LOCUS       AH007743     7832 bp    DNA             CON       26-MAY-1999
DEFINITION  Gallus gallus ornithine transcarbamylase (OTC) gene, complete cds.
ACCESSION   AH007743
VERSION     AH007743.1  GI:4927367
KEYWORDS    .
SOURCE      chicken.
  ORGANISM  Gallus gallus
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Archosauria;
            Aves; Neognathae; Galliformes; Phasianidae; Phasianinae; Gallus.
[....]
FEATURES             Location/Qualifiers
     source          1..7832
                     /organism="Gallus gallus"
                     /db_xref="taxon:9031"
                     /chromosome="1"
CONTIG      join(AF065630.1:1..1903,gap(),AF065631.1:1..435,gap(),
            AF065632.1:1..509,gap(),AF065633.1:1..722,gap(),AF065634.1:1..707,
            gap(),AF065635.1:1..836,gap(),AF065636.1:1..1614,gap(),
            AF065637.1:1..605,gap(),AF065638.1:1..501)
//

  The second class of data represented in gbcon.seq results from splitting large complete
genomes (eg, bacterial genomes) into multiple pieces. GenBank has a single-record size
limit of 350,000 bases. Complete genomes larger than this are split (without
interrupting any coding regions), with a small overlap between each piece (about 60 bases
for bacterial genomes). The CON division entries for these genomes provides instructions
for the re-assembly of the complete genome from these separate pieces.

  This new GenBank division is experimental, and thus some caveats apply to the new
data file:

  o gbcon.seq doesn't have a "header"
  o the content of gbcon.seq is not reflected in any of the index files (gb*.idx)
  o if all literature references for the members of a segmented set are
    associated with an incomplete span of the sequence, no references appear
    at all for the gbcon.seq entry
  o definition lines for segmented-set CON entries are sometimes less than ideal:
    - genus and species only
    - the content reflects one member of the segset, not the set as a whole
  o some complete bacterial genomes are not yet represented in gbcon.seq due
    to ongoing work with complete-genome dataflow

These problems will be addressed in the next two months. In the meantime, 
the gbcon.seq data file will give users a chance to experiment with this
new data representation. If problems other than those described above are
encountered, we'd like to hear of them. Please send your problem reports to
the NCBI Service Desk (see Section 6).

1.4 Upcoming Changes

1.4.1 Replacement of organelle-related qualifiers with /organelle
  
  A large variety of organelle qualifiers currently exist: /mitochondrion,
/chromoplast, /chloroplast, etc. Starting with GenBank Release 116.0 (February,
2000), these qualifiers will all be incorporated into a single new qualifier,
with a controlled value format. The preliminary description of this qualifier
is as follows:

Qualifier	/organelle=""

Definition	type of membrane-bound intracellular structure from 	
		which the sequence was obtained 

Value format	mitochondrion, nucleomorph, plastid, mitochondrion:kinetoplast,
                plastid:chloroplast, plastid:apicoplast, plastid:chromoplast, 
                plastid:cyanelle, plastid:leucoplast, plastid:proplastid

Examples        /organelle="mitochondrion"
                /organelle="nucleomorph"
                /organelle="plastid"
                /organelle="mitochondrion:kinetoplast"
                /organelle="plastid:chloroplast"
                /organelle="plastid:apicoplast"
                /organelle="plastid:chromoplast"
                /organelle="plastid:cyanelle"
                /organelle="plastid:leucoplast"
                /organelle="plastid:proplastid"

Comments	modifier text limited to values from controlled list

1.4.2 Mutation and Allele features to be discontinued

  Agreement was reached at the May 1999 collaborative DDBJ/EMBL/GenBank
meeting that the functionality provided by the variation, mutation, and
allele features could be represented by just a single feature, variation.
Submittors of sequence data are now being encouraged to use just the
variation feature. With GenBank Release 117.0, all existing mutation and
allele features will be converted to variation, and then mutation and
allele features will no longer be legal feature keys.

1.4.3 New REFERENCE type for on-line journals

  Agreement was reached at the May 1999 collaborative DDBJ/EMBL/GenBank
meeting that an effort should be made to accomodate references which are
published only on-line. Until specifications for such references are
available from library organizations, GenBank will present them in a manner
like this:

	REFERENCE   1  (bases 1 to 2858)
	  AUTHORS   Smith, J.
	  TITLE     Cloning and expression of a phospholipase gene
	  JOURNAL   Online Publication
	  REMARK    Online-Journal-name; Article Identifier; URL

  This format is still tentative; additional information about this new
reference type will be made available via these release notes.

1.4.4 Selenocysteine representation

  Selenocysteine residues within the protein translations of coding
region features have been represented in GenBank via the letter 'X'
and a /transl_except qualifier. At the May collaborative meeting, it
was learned that IUPAC plans to adopt the letter 'U' for selenocysteine.

  DDBJ, EMBL, and GenBank will thus use this new amino acid abbreviation
for its /translation qualifiers. Although a timetable for its appearance
has not been decided upon, we are mentioning this now because the
introduction of a new residue abbreviation is a fairly fundamental change.
Details about the use of 'U' will be made available via these release
notes as they become available.

1.4.5 VRL division will be split into multiple files

  The viral GenBank division (gbvrl.seq) will soon be split into multiple
files, since its size is approaching 300MB. This is likely to occur by
GenBank Release 117.0 (April 2000). The resulting files for VRL will be:
gbvrl1.seq and gbvrl2.seq .





More information about the Genbankb mailing list