[Maize] RefGen_v3 status update

Lawrence, Carolyn via maize%40net.bio.net (by Carolyn.Lawrence from ARS.USDA.GOV)
Tue May 22 16:46:00 EST 2012

Greetings, maize researchers.

Please see below to learn about the status of RefGen_v3.  Thanks to the War=
e group (USDA-ARS and CSHL) for providing this information!

Carolyn Lawrence

Maize B73 Reference Assembly Update & Release

The Maize Genome Sequencing Project is preparing to release a new version o=
f the maize B73 reference, designated B73 RefGen_v3.  The new reference ass=
embly is improved over the current version, RefGen_v2, primarily in the inc=
lusion of new genic regions, generated from a whole genome shotgun assembly=
 of 454 sequences, which fill gaps in the current BAC clone-based assembly.=
  Annotation of the amended assembly results in the definition of several h=
undred new and/or improved gene models.  This document describes the schedu=
le andprotocol for public release of the new assembly.

The Project will release RefGen_v3, via maizesequence.org, maizegdb.org, an=
d gramene.org, following submission and acceptance of the following data se=
ts to the International Nucleotide Sequence Database Collaboration (INSDC),=
 also known as DDBJ/EMBL/GenBank:

1)    B73 RefGen_v2 pseudomolecule scaffold sequence and AGP (short for =93=
A Golden Path=94, the table that specifies how the component contigs are co=
mbined to build the pseudomolecule scaffold sequences).

2)    The 454 whole genome shotgun assembly that serves as components of th=
e new assembly.

3)    B73 RefGen_v3 pseudomolecule scaffold sequence and AGP.

Best Practices for Supporting Reference Genomes:
Data providers, including single-organism community databases, multi-organi=
sm browsers, and NCBI, have struggled in recent years to maintain standardi=
zed and consistent representations of genome data within a given species.  =
The existence of disparate sequence data,coordinate systems, and identifier=
s harms the scientific community by preventing interoperability and fractur=
ing the research literature.  Forcing researchers to reconcile such differe=
nces hampers scientific progress.  These problems have prompted new policie=
s amongst data providers to insist on INSDC submission as a prerequisite fo=
r hosting genome data.  Examples include the Browser Genome Release Agreeme=
nt between the Ensembl, NCBI, and UCSC groups.
In addition to providing a unified source of data, submission to the INSDC =
ensures legitimacy of the assembly by application of rigorous standards.  T=
he vetting process includes, among other aspects:

1)    Ensuring that component contigs are already accessioned in DDBJ/EMBL/=

2)    Screening of component contigs for non-target organism contamination.

3)    Validating appropriate positioning and classification of gaps.

4)    Ensuring that AGP specification agrees with pseudomolecule sequence.

5)    Using standardized formatting for accurate representation of alldata =
and metadata.
The risk of not submitting to INSDC prior to release is realized when this =
validation process necessitates changes to the assembly or coordinate syste=
m, thus causing discrepancy with the released version.  Experience with the=
 submission of RefGen_v2, currently in use throughout the community, is ill=
ustrative of this problem.  While this submission is still in process, feed=
back from validation has so far included i) contamination of sequence from =
non-maize organisms; ii) inappropriate gap placement and length representat=
ion; iii) unacceptable construction of a =93chr0=94 to represent unanchored=
 scaffolds (chr0 needs to be broken up into individual scaffolds).  We are =
fortunate that GenBank is making allowances for RefGen_v2 so as to maintain=
 consistency of annotation coordinates with the public release already in u=

Process and Status:
The flow chart illustrates process and status for submissions.  For both an=
notations and AGP the process is iterative until final acceptance: test sub=
mission, feedback, revision, new test submission. While issues with v2 iden=
tified to this time have been incorporated into the preliminary AGP of v3, =
the final approved v2 AGP is critical to make a smooth submission of v3 on =
the heels of v2.  Similarly, final approval of v2 annotation files will be =
important for the submission of v3 annotations, as the vast majority of gen=
es will only require adjustments of coordinates.  Final approval of the 454=
 whole genome shotgun assembly is also on the critical path for release of =
RefGen_v3.  However, only a relatively small subset of the entire assembly =
is relevant to the v3 AGP, and NCBI is giving priority to these.  These hav=
e already passed contamination screening and we do not anticipate any addit=
ional issues to what should be a straightforward submission of nucleotide s=

For a figure describing the GenBank Submission process, history, and status=
, please visit http://images.maizegdb.org/public/genbank_submission.jpg.

       Carolyn J. Lawrence, Ph.D.
       USDA-ARS Research Geneticist

       carolyn.lawrence from ars.usda.gov<mailto:carolyn.lawrence from ars.usda.gov>

       (515) 294-4294 Office
       (515) 294-5332 Lab

This electronic message contains information generated by the USDA solely f=
or the intended recipients. Any unauthorized interception of this message o=
r the use or disclosure of the information it contains may violate the law =
and subject the violator to civil or criminal penalties. If you believe you=
 have received this message in error, please notify the sender and delete t=
he email immediately.

More information about the Maize mailing list