[Genbank-bb] Protein FASTA files for GenBank releases : new location and file convention

Cavanaugh, Mark (NIH/NLM/NCBI) [E] via genbankb%40net.bio.net (by cavanaug from ncbi.nlm.nih.gov)
Fri Feb 15 12:34:02 EST 2008


Prior to February 2008, a FASTA product for protein sequences from
coding regions annotated on the DNA sequences in GenBank has been
provided as a single large file:

	ftp://ftp.ncbi.nih.gov/genbank/relNNN.fsa_aa.gz

where 'NNN' represents a 3-digit GenBank release number.

The uncompressed size of this file has grown to exceed 4GB,
which is unmanageable for many users. So as of GenBank Release
164.0, individual protein FASTA files will be provided on
a per-division basis, in a new subdirectory of the NCBI FTP site:

	ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta

One such file would be named:

	gbpri1.fsa_aa.gz

Further information will be available via a README file in the
new ncbi-asn1/protein_fasta directory, when the Release 164.0
files are made installed (possibly by Saturday February 16).

Note that the location of the protein FASTA files is within the
/ncbi-asn1 area, not the /genbank area. Since the protein FASTA
files have a 1-to-1 correspondence with NCBI's ASN.1 files, this
is a more natural location for them.

[In fact, the quality-score data files currently located
 in /genbank/quality_scores would *also* be located more 
 naturally in the /ncbi-asn1 area. They may be relocated
 at a future date.]

The old single-file protein FASTA product will be supported
for two more GenBank releases, through Release 166.0 in June
of 2008. But after that release, the relNNN.fsa_aa.gz file
will be discontinued.

Mark Cavanaugh
GenBank
NCBI/NLM/NIH/HHS



More information about the Genbankb mailing list