From cavanaug from ncbi.nlm.nih.gov Thu Jun 11 16:45:19 2009 From: cavanaug from ncbi.nlm.nih.gov (Cavanaugh, Mark (NIH/NLM/NCBI) [E]) Date: Fri Jun 12 07:30:49 2009 Subject: [Genbank-bb] GenBank 172.0 Close-Of-Data Message-ID: <7B6F170840CA6C4DA63EE0C8A7BB43EC06CB4263@NIHCESMLBX15.nih.gov> Greetings GenBank Users, Close-of-data for the upcoming GenBank Release 172.0 occurred on Wednesday June 10 2009 at approximately 1:30am EDT. The subsequently generated GenBank Incremental Update files nc0610.aso, nc0610.flat, etc. contain data through the close. Note: Release processing often does not begin until sometime during business hours on the close date. As a result, a number of sequence records processed *after* 1:30am are likely to be present in the GenBank 172.0 release files, even though they are "post-close" . Similarly, the first GenBank Incremental Update that is generated after the close date is likely to contain a number of sequence records that are unchanged, compared to their appearance in the release files. We expect to make the GenBank 172.0 data files available sometime tomorrow. Our apologies for the lack of advanced notice about the close date. Mark Cavanaugh GenBank NCBI/NLM/NIH/HHS From cavanaug from ncbi.nlm.nih.gov Fri Jun 12 14:41:53 2009 From: cavanaug from ncbi.nlm.nih.gov (Cavanaugh, Mark (NIH/NLM/NCBI) [E]) Date: Fri Jun 12 14:43:46 2009 Subject: [Genbank-bb] GenBank Release 172.0 Now Available Message-ID: <7B6F170840CA6C4DA63EE0C8A7BB43EC06CB4457@NIHCESMLBX15.nih.gov> Greetings GenBank Users, GenBank Release 172.0 is now available via FTP from the National Center for Biotechnology Information (NCBI): Ftp Site Directory Contents ---------------- --------- --------------------------------------- ftp.ncbi.nih.gov genbank GenBank Release 172.0 flatfiles ncbi-asn1 ASN.1 data used to create Release 172.0 Close-of-data for GenBank 172.0 occured on 06/10/2009. Uncompressed, the Release 172.0 flatfiles require roughly 403 GB (sequence files only) or 431 GB (including the 'short directory', 'index' and the *.txt files). The ASN.1 data require approximately 366 GB. Recent statistics for non-WGS, non-CON sequences: Release Date Base Pairs Entries 171 Apr 2009 102980268709 103335421 172 Jun 2009 105277306080 106073709 Recent statistics for WGS sequences: Release Date Base Pairs Entries 171 Apr 2009 144522542010 48948309 172 Jun 2009 145959997864 49063546 During the 60 days between the close dates for GenBank Releases 171.0 and 172.0, the non-WGS/non-CON portion of GenBank grew by 2,297,037,371 basepairs and by 2,738,288 sequence records. During that same period, 3,680,844 records were updated. An average of about 106,985 non-WGS/non-CON records were added and/or updated per day. Between releases 171.0 and 172.0, the WGS component of GenBank grew by 1,437,455,854 basepairs and by 115,237 sequence records. For additional release information, see the README files in either of the directories mentioned above, and the release notes (gbrel.txt) in the genbank directory. Sections 1.3 and 1.4 of the release notes (Changes in Release 172.0 and Upcoming Changes) have been appended below for your convenience. ** Important Notes ** * GenBank 'index' files are now provided without any EST content, and without most GSS content. See Section 1.3.5 of the release notes for further details. NCBI is considering ceasing support for the index files, so we encourage affected users to review that section and provide feedback. Release 172.0 data, and subsequent updates, are available now via NCBI's Entrez and Blast services. As a general guideline, we suggest first transferring the GenBank release notes (gbrel.txt) whenever a release is being obtained. Check to make sure that the date and release number in the header of the release notes are current (eg: April 15 2009, 172.0). If they are not, interrupt the remaining transfers and then request assistance from the NCBI Service Desk. A comprehensive check of the headers of all release files after your transfers are complete is also suggested. Here's how one might go about this on a unix platform, using csh/tcsh : set files = `ls gb*.*` foreach i ($files) head -10 $i | grep Release end Or, if the files are compressed, perhaps: gzcat $i | head -10 | grep Release If you encounter problems while ftp'ing or uncompressing Release 172.0, please send email outlining your difficulties to: info@ncbi.nlm.nih.gov Mark Cavanaugh, Michael Kimelman, Ilya Dondoshansky, Sergey Zhdanov GenBank NCBI/NLM/NIH/HHS 1.3 Important Changes in Release 172.0 1.3.1 PROJECT linetype has been replaced by DBLINK The DBLINK linetype was introduced as of the February 2009 GenBank Release 170.0, to accomodate links to Project IDs and the NCBI Trace Assembly Archive, and new types of links that will arise in the future. DBLINK co-existed with its predecessor linetype (PROJECT) for GenBank releases 170.0 and 171.0 . With Release 172.0, however, the PROJECT line has been completely removed, as this record illustrates: LOCUS CP000964 5641239 bp DNA circular BCT 24-SEP-2008 DEFINITION Klebsiella pneumoniae 342, complete genome. ACCESSION CP000964 VERSION CP000964.1 GI:206564770 DBLINK Project:28471 1.3.2 Organizational changes The total number of sequence data files increased by 36 with this release: - the BCT division is now composed of 45 files (+5) - the ENV division is now composed of 16 files (+3) - the EST division is now composed of 875 files (+15) - the GSS division is now composed of 337 files (+2) - the INV division is now composed of 18 files (+3) - the PAT division is now composed of 73 files (+6) - the PLN division is now composed of 39 files (+1) - the VRL division is now composed of 12 files (+1) The total number of 'index' files increased by 2 with this release: - the JOU (journal) index is now composed of 7 files (+1) - the KEY (keyword) index is now composed of 4 files (+1) 1.3.3 File header problem for EST and GSS files A new method of generating the EST and GSS sequence files has been developed, which has reduced the time required to generate a GenBank release by one day. However, a minor problem in the formatting of the header of the sequence files was inadvertently introduced : a leading space exists before the filename on the very first line. For example: GBGSS100.SEQ Genetic Sequence Data Bank June 15 2009 It should be: GBGSS100.SEQ Genetic Sequence Data Bank June 15 2009 The problem effects all EST files and most GSS files. We had hoped to repair this formatting issue for Release 172.0, but the code changes just missed the cut-off for release generation. The problem will definitely be resolved for Release 173.0 . 1.3.4 Changes in the content of index files As described in the GB 153 release notes, the 'index' files which accompany GenBank releases (see Section 3.3) are considered to be a legacy data product by NCBI, generated mostly for historical reasons. FTP statistics of January 2005 seem to support this: the index files were transferred only half as frequently as the files of sequence records. The inherent inefficiencies of the index file format also lead us to suspect that they have little serious use by the user community, particularly for EST and GSS records. The software that generated the index file products received little attention over the years, and finally reached its limitations in February 2006 (Release 152.0). The required multi-server queries which obtained and sorted many millions of rows of terms from several different databases simply outgrew the capacity of the hardware used for GenBank Release generation. Our short-term solution is to cease generating some index-file content for all EST sequence records, and for GSS sequence records that originate via direct submission to NCBI. The three gbacc*.idx index files continue to reflect the entirety of the release, including all EST and GSS records, however the file contents are unsorted. These 'solutions' are really just stop-gaps, and we will likely pursue one of two options: a) Cease support of the 'index' file products altogether. b) Provide new products that present some of the most useful data from the legacy 'index' files, and cease support for other types of index data. If you are a user of the 'index' files associated with GenBank releases, we encourage you to make your wishes known, either via the GenBank newsgroup, or via email to NCBI's Service Desk: info@ncbi.nlm.nih.gov Our apologies for any inconvenience that these changes may cause. 1.3.5 GSS File Header Problem GSS sequences at GenBank are maintained in two different systems, depending on their origin, and the dumps from those systems occur in parallel. Because the second dump (for example) has no prior knowledge of exactly how many GSS files will be dumped by the first, it does not know how to number its own output files. There is thus a discrepancy between the filenames and file headers for seventy-two of the GSS flatfiles in Release 172.0. Consider gbgss266.seq : GBGSS1.SEQ Genetic Sequence Data Bank June 15 2009 NCBI-GenBank Flat File Release 172.0 GSS Sequences (Part 1) 87198 loci, 64267715 bases, from 87198 reported sequences Here, the filename and part number in the header is "1", though the file has been renamed as "266" based on the number of files dumped from the other system. We hope to resolve this discrepancy at some point, but the priority is certainly much lower than many other tasks. 1.4 Upcoming Changes 1.4.1 Qualifier changes from INSDC 2009 Several qualifier changes for the Feature Table were agreed to at the annual INSDC meeting in May 2009. Complete details and implementation timelines will be made available in the August GenBank Release Notes. In the meantime, here is an early preview of the changes that were approved: New value for /exception: /exception="annotated by transcript or proteomic data" /pseudo qualifier to be re-named as /non_functional Because the term "pseudo" is often equated with "pseudogene", the /pseudo qualifier will be renamed as /non_functional, to better reflect its actual usage. New /haplogroup qualifier defined for the source feature From cavanaug from ncbi.nlm.nih.gov Tue Jun 16 12:24:05 2009 From: cavanaug from ncbi.nlm.nih.gov (Cavanaugh, Mark (NIH/NLM/NCBI) [E]) Date: Tue Jun 16 12:57:26 2009 Subject: [Genbank-bb] GenBank Updates : Three day outage for GenBank Incremental Updates : 0613-0615 Message-ID: <7B6F170840CA6C4DA63EE0C8A7BB43EC06CB488D@NIHCESMLBX15.nih.gov> Due to an error in a configuration file, attempts to generate the 0613, 0614, and 0615 GenBank Incremental Update (GIU) products failed. This problem was resolved on Tuesday June 16 at approximately 1:00pm EDT, and a set of 0616 data files were made available at the NCBI FTP site at approximately 1:19pm EDT. The 0616 GIU contains all GenBank records new/modified since 1:33am EDT on June 12. Our apologies for the inconvenience that this outage caused. Mark Cavanaugh GenBank NCBI/NLM/NIH/HHS