IUBio

Weekly GenBank Accession Lists

Mark Cavanaugh cavanaug at lagrange.nlm.nih.gov
Tue Nov 16 12:10:22 EST 1999


Greetings GenBank Users,

For several months, lists of all nucleotide and protein accession
numbers that are live and public within the GenBank database have
been generated on a weekly basis and installed in the genbank/livelists
directory at the NCBI ftp site. However, this new data product hasn't
been formally announced (until now).

An excerpt from the README.genbank file:

	ftp://ncbi.nlm.nih.gov/genbank/README.genbank

that describes the lists in more detail is enclosed below. If you
still have questions about the accession number lists after reading
it, please send them to the NCBI Service Desk:

	info at ncbi.nlm.nih.gov

Mark Cavanaugh
GenBank
NCBI/NLM/NIH

===============================================================================

FTP Site:  ncbi.nlm.nih.gov
Directory: genbank/livelists
URL:       ftp://ncbi.nlm.nih.gov/genbank/livelists/

  This directory contains lists, generated weekly on Sunday evening at
approximately 6:00pm EST/EDT, of all nucleotide and protein accessions
in GenBank. File names for these lists are of the form:

	GbAccList.MMDD.YYYY.Z

where MM represents a 2-digit value for the month, DD represents a 2-digit
value for the day, and YYYY represents a four-digit value for the year.
These files have been compressed with the Unix compress command, hence the
".Z" suffix.

  Each line of these lists contains three comma-delimited values: accession
number, sequence version number, and NCBI GI identifier. Protein accessions
can be easily distinguished from nucleotide accessions because they have a
three-letter prefix, followed by five digits. The remaining accessions are
nucleotide accessions, in either a one-letter/five-digit format or a
two-letter/six-digit format.

  Here's an example from the accession list for AF093062 and its protein
translation AAC64372 :

	AF093062,2,6019463
	AAC64372,2,6019464

  In the GenBank flatfile representation of AF093062, these fields can be
found on the VERSION line and in the /protein_id and /db_xref qualifiers of
the coding region feature:

LOCUS       AF093062     2795 bp    DNA             INV       12-OCT-1999
DEFINITION  Leishmania major polyadenylate-binding protein 1 (PAB1) gene,
            complete cds.
ACCESSION   AF093062
VERSION     AF093062.2  GI:6019463
....
     CDS             263..1945
                     /gene="PAB1"
                     /note="polyA-binding protein"
                     /codon_start=1
                     /product="polyadenylate-binding protein 1"
                     /protein_id="AAC64372.2"
                     /db_xref="GI:6019464"
                     /translation="MAAAVQEAAAPVAHQPQMDKPIEIASIYVGDLDATINEPQ....

  In the ASN.1 representation of AF093062, these fields can be found within
the Bioseq.id chain of the nucleotide and protein bioseqs:

    seq {
      id {
        genbank {
          name "AF093062" ,
          accession "AF093062" ,
          version 2 } ,
        gi 6019463 } ,
	....
    seq {
      id {
        genbank {
          accession "AAC64372" ,
          version 2 } ,
        gi 6019464 } ,





More information about the Genbankb mailing list

Send comments to us at biosci-help [At] net.bio.net