[Genbank-bb] Release 163.0 Problem : Duplicate protein sequences in FASTA companion file

Cavanaugh, Mark (NIH/NLM/NCBI) [E] via genbankb%40net.bio.net (by cavanaug from ncbi.nlm.nih.gov)
Thu Jan 24 14:01:52 EST 2008


Dear GenBank Users,

Due to a processing error, 179,546 protein sequences were represented
twice
in the protein FASTA file that accompanies GenBank Release 163.0 :

        ftp://ftp.ncbi.nih.gov/genbank/rel163.fsa_aa.gz

On Thursday January 24 at approximately 1:55pm EST, the file was
replaced
with a new version, for which the duplicate protein sequences have been
removed.

The filesizes and timestamps of the original and repaired files are:

  -r--r--r--   1 cavanaug gbproces 1905172189 Dec 22 16:22
rel163.fsa_aa.gz
  -r--r--r--   1 cavanaug gbproces 1870005097 Jan 24 13:55
rel163.fsa_aa.gz

Our thanks to V. Martin at INRA for reporting this problem to the NCBI
Service Desk ( info from ncbi.nlm.nih.gov ). It was revealed during an
attempt
to build a BLAST database from the FASTA file using formatdb. These
messages were present in the formatdb log file:

        Closing volume genpept.01 with 2033856 sequences, 499,999,786
letters
        (.psq file = 502033908 bytes; .phr file = 242472809
        NIsam key file genpept.01.pnd not in sorted order!
        unsorted or non-unique elements:#3318, #3319 : 154883, 154883
        ERROR: [000.000] Failed to create index.  Possibly a gi included
more 
        than once in the database.

Procedural changes in the generation of files for the CON division of
GenBank
led to the duplication. The underlying cause has been identified and
fixed.

My apologies for any inconvenience that this error may have caused.

Mark Cavanaugh
GenBank
NCBI/NLM/NIH/HHS



More information about the Genbankb mailing list