Greetings GenBank Users,
For several months, lists of all nucleotide and protein accession
numbers that are live and public within the GenBank database have
been generated on a weekly basis and installed in the genbank/livelists
directory at the NCBI ftp site. However, this new data product hasn't
been formally announced (until now).
An excerpt from the README.genbank file:
ftp://ncbi.nlm.nih.gov/genbank/README.genbank
that describes the lists in more detail is enclosed below. If you
still have questions about the accession number lists after reading
it, please send them to the NCBI Service Desk:
info at ncbi.nlm.nih.gov
Mark Cavanaugh
GenBank
NCBI/NLM/NIH
===============================================================================
FTP Site: ncbi.nlm.nih.gov
Directory: genbank/livelists
URL: ftp://ncbi.nlm.nih.gov/genbank/livelists/
This directory contains lists, generated weekly on Sunday evening at
approximately 6:00pm EST/EDT, of all nucleotide and protein accessions
in GenBank. File names for these lists are of the form:
GbAccList.MMDD.YYYY.Z
where MM represents a 2-digit value for the month, DD represents a 2-digit
value for the day, and YYYY represents a four-digit value for the year.
These files have been compressed with the Unix compress command, hence the
".Z" suffix.
Each line of these lists contains three comma-delimited values: accession
number, sequence version number, and NCBI GI identifier. Protein accessions
can be easily distinguished from nucleotide accessions because they have a
three-letter prefix, followed by five digits. The remaining accessions are
nucleotide accessions, in either a one-letter/five-digit format or a
two-letter/six-digit format.
Here's an example from the accession list for AF093062 and its protein
translation AAC64372 :
AF093062,2,6019463
AAC64372,2,6019464
In the GenBank flatfile representation of AF093062, these fields can be
found on the VERSION line and in the /protein_id and /db_xref qualifiers of
the coding region feature:
LOCUS AF093062 2795 bp DNA INV 12-OCT-1999
DEFINITION Leishmania major polyadenylate-binding protein 1 (PAB1) gene,
complete cds.
ACCESSION AF093062
VERSION AF093062.2 GI:6019463
....
CDS 263..1945
/gene="PAB1"
/note="polyA-binding protein"
/codon_start=1
/product="polyadenylate-binding protein 1"
/protein_id="AAC64372.2"
/db_xref="GI:6019464"
/translation="MAAAVQEAAAPVAHQPQMDKPIEIASIYVGDLDATINEPQ....
In the ASN.1 representation of AF093062, these fields can be found within
the Bioseq.id chain of the nucleotide and protein bioseqs:
seq {
id {
genbank {
name "AF093062" ,
accession "AF093062" ,
version 2 } ,
gi 6019463 } ,
....
seq {
id {
genbank {
accession "AAC64372" ,
version 2 } ,
gi 6019464 } ,