GenBank Release 115.0 is now available via ftp from the National Center
for Biotechnology Information:
Ftp Site Directory Contents
---------------- --------- ---------------------------------------
ncbi.nlm.nih.gov genbank GenBank Release 115.0 flatfiles
ncbi-asn1 ASN.1 data used to create Release 115.0
Uncompressed, the Release 115.0 flatfiles require roughly 18917 MB
(sequence files only) or 22026 MB (including the 'index' files). The
ASN.1 version requires roughly 16693 MB. From the release notes:
Release Date Base Pairs Entries
114 Oct 99 3841163011 4864570
115 Dec 99 4653932745 5354511
The 812 Mbp growth since Release 114.0 is the largest single-release
increase ever experienced by GenBank (the previous record was 441 Mbp).
Close-of-data was 12/10/99. Twelve days were required to prepare this
release. For additional information, see the README files in either of the
directories mentioned above, and the release notes (gbrel.txt) in the
genbank directory. Sections 1.3 and 1.4 of the release notes (Changes in
Release 115.0 and Upcoming Changes) have been appended below.
Sections 1.3.2 and 1.3.3 describe particularly important changes related
to NID/PID and the new gbcon.seq division, implemented with this release.
Release 115.0 data are currently available via NCBI's Entrez and Blast
servers, and the 'query' email server.
New GenBank cumulative update files (gbcu.flat.Z and gbcu.aso.Z), containing
only those entries new/updated since the Release 115.0 close-of-data, should be
available by 9:00am EST, December 22. Please note that the new CUs will be
smaller than previous versions you might have obtained after Release 114.0 was
posted.
If you encounter problems while ftp'ing or uncompressing Release 115.0,
please send email outlining your difficulties to info at ncbi.nlm.nih.gov .
Mark Cavanaugh & Vladimir Aleksey
GenBank
NCBI/NLM/NIH
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
1.3 Important Changes in Release 115.0
1.3.1 Organizational changes
Due to database growth, the PLN division is now being split into three pieces.
Due to database growth, the EST division is now being split into forty-four
pieces.
Due to database growth, the HTG division is now being split into seven pieces.
Due to database growth, the GSS division is now being split into fifteen pieces.
1.3.2 Removal of obsolete NID linetype and PID /db_xref
When the Accession.Version system was introduced in the Spring of 1999, we
stated via the GenBank newsgroup that the NID and PID linetypes would
eventually be phased out:
"The NID linetype and the PID /db_xref qualifier will eventually be
removed from the GenBank flatfile format, probably by August 1999.
However, the NCBI GI identifiers that they contain will remain
available, via the new VERSION linetype and the new GI /db_xref
qualifier."
This removal of NID/PID has been implemented with GenBank Release
115.0 (December 1999). Excerpts from a GenBank record which illustrates
the format change are appended below.
NCBI "GI" sequence identifiers can still be obtained from GenBank flatfiles
via the VERSION linetype and the GI /db_xref qualifier of CDS features.
Previous GenBank flatfile view of the HUMCCLEC1 segmented set:
LOCUS HUMCCLEC1 17079 bp DNA PRI 04-FEB-1999
DEFINITION Homo sapiens cartilage-derived C-type lectin (CLECSF1) gene, exons
1 and 2.
ACCESSION AF077344 AF077343
NID g3982778
VERSION AF077344.1 GI:3982778
KEYWORDS .
SEGMENT 1 of 2
SOURCE human.
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia;
Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE 1 (bases 1 to 17079)
AUTHORS Neame,P.J. and Tapp,H.
TITLE The Cartilage-Derived, C-type Lectin (CLECSF1); Structure of the
Human Gene and Chromosomal Location
JOURNAL Unpublished
....
FEATURES Location/Qualifiers
source 1..17079
/organism="Homo sapiens"
/db_xref="taxon:9606"
/chromosome="16"
/map="16q23"
/tissue_type="cartilage"
repeat_region complement(180..393)
/rpt_family="MIR"
....
//
LOCUS HUMCCLEC2 3362 bp DNA PRI 04-FEB-1999
DEFINITION Homo sapiens cartilage-derived C-type lectin (CLECSF1) gene, exon 3
and complete cds.
ACCESSION AF077345
NID g3386489
VERSION AF077345.1 GI:3386489
KEYWORDS .
SEGMENT 2 of 2
SOURCE human.
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia;
Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE 1 (bases 1 to 3362)
AUTHORS Neame,P.J. and Tapp,H.
TITLE The Cartilage-Derived, C-type Lectin (CLECSF1); Structure of the
Human Gene and Chromosomal Location
JOURNAL Unpublished
....
FEATURES Location/Qualifiers
source 1..3362
/organism="Homo sapiens"
/db_xref="taxon:9606"
/chromosome="16"
/map="16q23"
/tissue_type="cartilage"
mRNA join(AF077344.1:10886..11079,AF077344.1:16446..16529,
777..1171)
/gene="CLECSF1"
/product="cartilage-derived C-type lectin"
gene order(AF077344.1:10886..17079,1..3362)
/gene="CLECSF1"
CDS join(AF077344.1:10965..11079,AF077344.1:16446..16529,
777..1171)
/gene="CLECSF1"
/note="similar to tetranectin"
/codon_start=1
/product="cartilage-derived C-type lectin"
/protein_id="AAD12542.1"
/db_xref="PID:g3386491"
/db_xref="GI:3386491"
/translation="MAKNGLVICILVITLLLDQTTSHTSRLKARKHSKRRVRDKDGDL
KTQIEKLWTEVNALKEIQALQTVCLRGTKVHKKCYLASEGLKHFHEANEDCISKGGIL
VIPRNSDEINALQDYGKRSLPGVNDFWLGINDMVTEGKFVDVNGIAISFLNWDRAQPN
GGKRENCVLFSQSAQGKWSDEACRSSKRYICEFTIPQ"
....
//
GenBank flatfile view of the HUMCCLEC1 segmented set as of December 1999.
Note the lack of the NID linetype and the PID /db_xref :
LOCUS HUMCCLEC1 17079 bp DNA PRI 04-FEB-1999
DEFINITION Homo sapiens cartilage-derived C-type lectin (CLECSF1) gene, exons
1 and 2.
ACCESSION AF077344 AF077343
VERSION AF077344.1 GI:3982778
KEYWORDS .
SEGMENT 1 of 2
SOURCE human.
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia;
Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE 1 (bases 1 to 17079)
AUTHORS Neame,P.J. and Tapp,H.
TITLE The Cartilage-Derived, C-type Lectin (CLECSF1); Structure of the
Human Gene and Chromosomal Location
JOURNAL Unpublished
....
FEATURES Location/Qualifiers
source 1..17079
/organism="Homo sapiens"
/db_xref="taxon:9606"
/chromosome="16"
/map="16q23"
/tissue_type="cartilage"
repeat_region complement(180..393)
/rpt_family="MIR"
....
//
LOCUS HUMCCLEC2 3362 bp DNA PRI 04-FEB-1999
DEFINITION Homo sapiens cartilage-derived C-type lectin (CLECSF1) gene, exon 3
and complete cds.
ACCESSION AF077345
VERSION AF077345.1 GI:3386489
KEYWORDS .
SEGMENT 2 of 2
SOURCE human.
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia;
Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE 1 (bases 1 to 3362)
AUTHORS Neame,P.J. and Tapp,H.
TITLE The Cartilage-Derived, C-type Lectin (CLECSF1); Structure of the
Human Gene and Chromosomal Location
JOURNAL Unpublished
....
FEATURES Location/Qualifiers
source 1..3362
/organism="Homo sapiens"
/db_xref="taxon:9606"
/chromosome="16"
/map="16q23"
/tissue_type="cartilage"
mRNA join(AF077344.1:10886..11079,AF077344.1:16446..16529,
777..1171)
/gene="CLECSF1"
/product="cartilage-derived C-type lectin"
gene order(AF077344.1:10886..17079,1..3362)
/gene="CLECSF1"
CDS join(AF077344.1:10965..11079,AF077344.1:16446..16529,
777..1171)
/gene="CLECSF1"
/note="similar to tetranectin"
/codon_start=1
/product="cartilage-derived C-type lectin"
/protein_id="AAD12542.1"
/db_xref="GI:3386491"
/translation="MAKNGLVICILVITLLLDQTTSHTSRLKARKHSKRRVRDKDGDL
KTQIEKLWTEVNALKEIQALQTVCLRGTKVHKKCYLASEGLKHFHEANEDCISKGGIL
VIPRNSDEINALQDYGKRSLPGVNDFWLGINDMVTEGKFVDVNGIAISFLNWDRAQPN
GGKRENCVLFSQSAQGKWSDEACRSSKRYICEFTIPQ"
....
//
1.3.3 Introduction of the "CON" division
A new and experimental file called gbcon.seq is included with GenBank Release 115.0.
This data file contains instructions for the assembly of larger-scale objects (eg,
"contigs", hence the new division's name) from individual GenBank records. It is unusual
in that the records in this file contain no sequence data at all. For an overview of this
experimental GenBank division, please see the following NCBI News article:
http://www.ncbi.nlm.nih.gov/Web/Newsltr/Fall99/contig.html
One class of data represented in gbcon.seq is derived from "segmented sets". Genomic
DNA sequences can sometimes be incomplete because submittors choose to sequence just the
exons, small portions of the intervening introns, and perhaps some upstream or downstream
control regions. These related sequences are labelled "1 of N", "2 of N", etc, via
the SEGMENT line in the GenBank flatfile format. There is often a coding region feature
on the last member, with a join() location pointing to exon intervals that contribute
to the coding sequence. Gaps of unknown size typically exist between the sequenced pieces.
Entries in the new gbcon.seq file make the relationships among such pieces more explicit.
In addition, these entries have accession numbers, version numbers, and NCBI GIs, just like
regular GenBank records. Here's an example illustrating these points:
LOCUS AH007743 7832 bp DNA CON 26-MAY-1999
DEFINITION Gallus gallus ornithine transcarbamylase (OTC) gene, complete cds.
ACCESSION AH007743
VERSION AH007743.1 GI:4927367
KEYWORDS .
SOURCE chicken.
ORGANISM Gallus gallus
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Archosauria;
Aves; Neognathae; Galliformes; Phasianidae; Phasianinae; Gallus.
[....]
FEATURES Location/Qualifiers
source 1..7832
/organism="Gallus gallus"
/db_xref="taxon:9031"
/chromosome="1"
CONTIG join(AF065630.1:1..1903,gap(),AF065631.1:1..435,gap(),
AF065632.1:1..509,gap(),AF065633.1:1..722,gap(),AF065634.1:1..707,
gap(),AF065635.1:1..836,gap(),AF065636.1:1..1614,gap(),
AF065637.1:1..605,gap(),AF065638.1:1..501)
//
The second class of data represented in gbcon.seq results from splitting large complete
genomes (eg, bacterial genomes) into multiple pieces. GenBank has a single-record size
limit of 350,000 bases. Complete genomes larger than this are split (without
interrupting any coding regions), with a small overlap between each piece (about 60 bases
for bacterial genomes). The CON division entries for these genomes provides instructions
for the re-assembly of the complete genome from these separate pieces.
This new GenBank division is experimental, and thus some caveats apply to the new
data file:
o gbcon.seq doesn't have a "header"
o the content of gbcon.seq is not reflected in any of the index files (gb*.idx)
o if all literature references for the members of a segmented set are
associated with an incomplete span of the sequence, no references appear
at all for the gbcon.seq entry
o definition lines for segmented-set CON entries are sometimes less than ideal:
- genus and species only
- the content reflects one member of the segset, not the set as a whole
o some complete bacterial genomes are not yet represented in gbcon.seq due
to ongoing work with complete-genome dataflow
These problems will be addressed in the next two months. In the meantime,
the gbcon.seq data file will give users a chance to experiment with this
new data representation. If problems other than those described above are
encountered, we'd like to hear of them. Please send your problem reports to
the NCBI Service Desk (see Section 6).
1.4 Upcoming Changes
1.4.1 Replacement of organelle-related qualifiers with /organelle
A large variety of organelle qualifiers currently exist: /mitochondrion,
/chromoplast, /chloroplast, etc. Starting with GenBank Release 116.0 (February,
2000), these qualifiers will all be incorporated into a single new qualifier,
with a controlled value format. The preliminary description of this qualifier
is as follows:
Qualifier /organelle=""
Definition type of membrane-bound intracellular structure from
which the sequence was obtained
Value format mitochondrion, nucleomorph, plastid, mitochondrion:kinetoplast,
plastid:chloroplast, plastid:apicoplast, plastid:chromoplast,
plastid:cyanelle, plastid:leucoplast, plastid:proplastid
Examples /organelle="mitochondrion"
/organelle="nucleomorph"
/organelle="plastid"
/organelle="mitochondrion:kinetoplast"
/organelle="plastid:chloroplast"
/organelle="plastid:apicoplast"
/organelle="plastid:chromoplast"
/organelle="plastid:cyanelle"
/organelle="plastid:leucoplast"
/organelle="plastid:proplastid"
Comments modifier text limited to values from controlled list
1.4.2 Mutation and Allele features to be discontinued
Agreement was reached at the May 1999 collaborative DDBJ/EMBL/GenBank
meeting that the functionality provided by the variation, mutation, and
allele features could be represented by just a single feature, variation.
Submittors of sequence data are now being encouraged to use just the
variation feature. With GenBank Release 117.0, all existing mutation and
allele features will be converted to variation, and then mutation and
allele features will no longer be legal feature keys.
1.4.3 New REFERENCE type for on-line journals
Agreement was reached at the May 1999 collaborative DDBJ/EMBL/GenBank
meeting that an effort should be made to accomodate references which are
published only on-line. Until specifications for such references are
available from library organizations, GenBank will present them in a manner
like this:
REFERENCE 1 (bases 1 to 2858)
AUTHORS Smith, J.
TITLE Cloning and expression of a phospholipase gene
JOURNAL Online Publication
REMARK Online-Journal-name; Article Identifier; URL
This format is still tentative; additional information about this new
reference type will be made available via these release notes.
1.4.4 Selenocysteine representation
Selenocysteine residues within the protein translations of coding
region features have been represented in GenBank via the letter 'X'
and a /transl_except qualifier. At the May collaborative meeting, it
was learned that IUPAC plans to adopt the letter 'U' for selenocysteine.
DDBJ, EMBL, and GenBank will thus use this new amino acid abbreviation
for its /translation qualifiers. Although a timetable for its appearance
has not been decided upon, we are mentioning this now because the
introduction of a new residue abbreviation is a fairly fundamental change.
Details about the use of 'U' will be made available via these release
notes as they become available.
1.4.5 VRL division will be split into multiple files
The viral GenBank division (gbvrl.seq) will soon be split into multiple
files, since its size is approaching 300MB. This is likely to occur by
GenBank Release 117.0 (April 2000). The resulting files for VRL will be:
gbvrl1.seq and gbvrl2.seq .