From cavanaug from ncbi.nlm.nih.gov Thu Jan 24 14:01:52 2008 From: cavanaug from ncbi.nlm.nih.gov (Cavanaugh, Mark (NIH/NLM/NCBI) [E]) Date: Thu Jan 24 14:02:02 2008 Subject: [Genbank-bb] Release 163.0 Problem : Duplicate protein sequences in FASTA companion file Message-ID: <7F40ACD22B0A23448C4E8755E5875FE70FA91551@NIHCESMLBX8.nih.gov> Dear GenBank Users, Due to a processing error, 179,546 protein sequences were represented twice in the protein FASTA file that accompanies GenBank Release 163.0 : ftp://ftp.ncbi.nih.gov/genbank/rel163.fsa_aa.gz On Thursday January 24 at approximately 1:55pm EST, the file was replaced with a new version, for which the duplicate protein sequences have been removed. The filesizes and timestamps of the original and repaired files are: -r--r--r-- 1 cavanaug gbproces 1905172189 Dec 22 16:22 rel163.fsa_aa.gz -r--r--r-- 1 cavanaug gbproces 1870005097 Jan 24 13:55 rel163.fsa_aa.gz Our thanks to V. Martin at INRA for reporting this problem to the NCBI Service Desk ( info@ncbi.nlm.nih.gov ). It was revealed during an attempt to build a BLAST database from the FASTA file using formatdb. These messages were present in the formatdb log file: Closing volume genpept.01 with 2033856 sequences, 499,999,786 letters (.psq file = 502033908 bytes; .phr file = 242472809 NIsam key file genpept.01.pnd not in sorted order! unsorted or non-unique elements:#3318, #3319 : 154883, 154883 ERROR: [000.000] Failed to create index. Possibly a gi included more than once in the database. Procedural changes in the generation of files for the CON division of GenBank led to the duplication. The underlying cause has been identified and fixed. My apologies for any inconvenience that this error may have caused. Mark Cavanaugh GenBank NCBI/NLM/NIH/HHS