[Genbank-bb] Protein FASTA files for GenBank releases : new
location and file convention
Cavanaugh, Mark (NIH/NLM/NCBI) [E]
(by cavanaug from ncbi.nlm.nih.gov)
Fri Feb 15 12:34:02 EST 2008
Prior to February 2008, a FASTA product for protein sequences from
coding regions annotated on the DNA sequences in GenBank has been
provided as a single large file:
where 'NNN' represents a 3-digit GenBank release number.
The uncompressed size of this file has grown to exceed 4GB,
which is unmanageable for many users. So as of GenBank Release
164.0, individual protein FASTA files will be provided on
a per-division basis, in a new subdirectory of the NCBI FTP site:
One such file would be named:
Further information will be available via a README file in the
new ncbi-asn1/protein_fasta directory, when the Release 164.0
files are made installed (possibly by Saturday February 16).
Note that the location of the protein FASTA files is within the
/ncbi-asn1 area, not the /genbank area. Since the protein FASTA
files have a 1-to-1 correspondence with NCBI's ASN.1 files, this
is a more natural location for them.
[In fact, the quality-score data files currently located
in /genbank/quality_scores would *also* be located more
naturally in the /ncbi-asn1 area. They may be relocated
at a future date.]
The old single-file protein FASTA product will be supported
for two more GenBank releases, through Release 166.0 in June
of 2008. But after that release, the relNNN.fsa_aa.gz file
will be discontinued.
More information about the Genbankb