FTP Products for TPA Sequences
cavanaug at ncbi.nlm.nih.gov
Thu Jan 30 17:52:05 EST 2003
Greetings GenBank Users,
As described in the GenBank 133.0 release notes:
a new class of sequence data is now being collected by GenBank, EMBL, and
DDBJ : Third-Party Annotation (TPA) data. Enclosed below is a slightly
updated version of the TPA announcement. More information about the TPA
effort can be found at the NCBI website:
Starting Friday, January 31, TPA update products will be made available
at the NCBI FTP site. Daily, incremental update files for all new/updated
TPA records will be located in:
The TPA updates will have filename prefixes of:
Filename suffixes for these updates will be:
.bbs : binary Bioseq-set (ASN.1)
.gbff : GenBank flatfile
.gnp : GenPept flatfile
.fsa_nt : Nucleotide FASTA
.fsa_aa : Protein FASTA
We do not expect to generate complete releases (similar to GenBank
releases) for TPA until the volume of TPA records has substantially
increased. Until that time, a set of cumulative TPA update files
containing all TPA records will be made available in:
The cumulative TPA update files will have filename prefixes of:
They will utilize the same filename suffixes that are listed above.
NOTE: The cumulative TPA products will be *discontinued* once TPA
releases are being built.
Initially, the TPA records included in these update files will be
limited to those submitted to GenBank. EMBL and DDBJ TPAs will be
added no later than Friday, February 7 2003.
READMEs for the TPA directories will also be installed by that date.
The Third-Party Annotation Data Collection
Pursuant to agreements made at their 2002 Collaborative Meeting,
DDBJ/EMBL/GenBank have undertaken the collection of a new class of
sequence data : Third-Party Annotation (TPA).
The TPA data-collection will complement the existing DDBJ/EMBL/GenBank
comprehensive database of primary nucleotide sequences, which typically
result from direct sequencing of cDNAs, ESTs, genomic DNAs, etc.
'Primary data' are defined to be data for which the submitting group has
done the sequencing and annotation, and hence, as owner of the data,
has privileges to update/correct the associated sequence records.
In contrast, non-primary (TPA) sequences are defined as sequences which:
a) consist exclusively of sequence data from one, or several,
previously-existing primary entries owned by other groups, or
b) consist of a mixture of previously-existing primary entries,
some owned by the TPA submittor and the rest by one or more other
TPA categories and requirements
Users can submit new annotation of single sequences or assemblies
of sequences that are owned by other groups to the TPA data
The primary sequences must be available in the DDBJ/EMBL/GenBank
databases, and submitters to the TPA database must provide the
accession numbers of the primary sequences in their TPA submission.
TPA sequences based on primary data available only in proprietary
databases are not accepted.
Some examples of data submissions accepted for TPA include:
1. analysis and re-annotation of DDBJ/EMBL/GenBank sequences
owned by other groups
2. gap-filling, in which a TPA submittor might utilize HTG or
EST data to complete an otherwise incomplete sequence
3. TPA sequences based on NCBI/Ensembl trace archive data
4. TPA sequences based on Whole Genome Shotgun (WGS) sequences
Sequences based on primary data from multiple organisms are not
Sequences will not be accepted for TPA in lieu of an update to
primary records. A submittor who owns a primary record is expected
to update that record as new sequence is determined, or sequencing
ambiguities/errors are resolved.
Any newly-determined sequence data that is to be part of a TPA
record must first be submitted as a new primary sequence to
The TPA dataset is intended to present sequence data and annotation
in support of actual biological discoveries that are published in
the scientific literature, without requiring that the sequence be
determined by the authors/submitters.
In order to assure that the sequence annotation is of high quality,
it is required that TPA records be associated with a study published
in a peer-reviewed journal before the data is released to the public.
TPA records include a mandatory 'PRIMARY' block, which documents the
relationships between spans of the TPA sequence and the primary
(non-TPA) sequences that contributed to it. The elements of the
PRIMARY block are:
a) TPA-SPAN base span on TPA sequence
b) PRIMARY_IDENTIFIER acc.version of contributing sequence(s)
c) PRIMARY_SPAN base span on contributing primary sequence
d) COMP 'c' is used to indicate that contributing
sequence is originating from complementary
strand in primary sequence entry
TPA_SPAN PRIMARY_IDENTIFIER PRIMARY_SPAN COMP
1-426 AC004528.1 18665-19090
427-526 AC001234.2 1-100 c
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb
- GenBank on the WWW, see: http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis at cmmt.ubc.ca
More information about the Genbankb