Important Genome Project Details

David W. Meinke meinke at OSUUNX.UCC.OKSTATE.EDU
Wed Oct 2 11:53:01 EST 1996


Dear Colleagues in the Arabidopsis Community:

On August 20-21, 1996, representatives of research groups
committed to sequencing the Arabidopsis genome met in Arlington,
VA to discuss strategies for facilitating international
cooperation in completing the genome project.  The primary
objectives of this meeting were to establish Arabidopsis as a
model for international coordination of sequencing efforts and to
develop guidelines for rapid and efficient completion of the
sequencing project by the year 2004.  Included at the meeting were
representatives of the following groups currently funded to
participate in  large-scale sequencing of the Arabidopsis genome:

EU Consortium of 17 European laboratories (Mike Bevan, PI)

Kazusa DNA Research Institute, Chiba, Japan (Satoshi Tabata, PI)

Cold Spring Harbor / Washington University / Applied 
Biosystems Consortium (Richard McCombie, PI)

Stanford University / U.C. Berkeley-PGEC / Univ. Pennsylvania 
Consortium (Ron Davis, PI)

The Institute for Genomic Research (TIGR)
(Craig Venter, PI)

The US groups represented at this meeting were recently awarded a
total of $ 12.7 million over 3 years in NSF/DOE/USDA funds to
participate in large-scale sequencing of the Arabidopsis genome. 
Details of this award were given in a press release issued last
week from Washington, D.C.  A copy of that press release has
already been distributed to the community through the Arabidopsis
newsgroup.  

A remarkable degree of consensus was reached by the end of the
August meeting on the general strategy for the Arabidopsis
sequencing project.  All parties agreed to follow several
practices that were seen as facilitating international
cooperation.  A Memorandum of Understanding was drafted to serve
as a modus operandi for the participating groups.  An edited
version of this document is presented below.  The complete
document will be made available to the community once the formal
paperwork has been completed.  As chairman of the Science Steering
Committee for the Multinational Coordinated Arabidopsis Genome
Research Project, I am pleased to report that the sequencing
effort is moving ahead with a level of expertise and spirit of
cooperation that should make Arabidopsis a model for other genome
projects.

David Meinke
Chair, Science Steering Committee
Multinational Arabidopsis Genome Project


Multinational Effort to Sequence the Arabidopsis Genome:


1.  The Arabidopsis Genome Initiative (AGI) is intended to be an
inclusive international collaboration.  Any group that intends to
engage in the sequencing of hundreds of kilobases of contiguous
Arabidopsis genomic DNA will be invited to participate as a
coequal collaborator in the AGI and will be expected to follow the
guidelines outlined in this document.

2.  A coordinating committee with representation from each of the
participating groups was formed.  This committee will be
responsible for making all decisions that affect the overall goals
and operations of the AGI.  In particular, it is anticipated that
the AGI coordinating committee will be a planning and brokering
system for establishing efficient ways of completing the genome. 
The committee will coordinate apportioning regions of the genome
to the various groups in such a way as to minimize needless
duplication of effort while maximizing progress toward complete
sequencing of the genome. The committee will also be responsible
for keeping the Arabidopsis community informed of continuing
advances in the sequencing project.  

Mike Bevan will chair the committee for the first year.  Other
representatives include: Satoshi Tabata (Kazusa DNA Research
Institute), Joe Ecker (SPP consortium), Dick McCombie
(CSH-WU-ABI), Steve Rounsley (TIGR), and David Meinke
(Multinational Arabidopsis Steering Committee). Each member of the
committee will be responsible for arranging a temporary or
permanent replacement from the represented group when appropriate. 
New members will be invited to join the committee based on a
nomination from one member of the committee and an affirmative
vote by a majority.  It is anticipated that the committee will
maintain regular communication and will meet annually.  

3.  Each research group is expected to complete different amounts
of finished sequence because each has different capabilities and
levels of funding devoted to this project.  In order to prevent
duplication of effort, it was considered useful to have the
various groups initiate sequencing in different well- defined
regions of the genome.  It was agreed that each group should begin
by nucleating sites over a contiguous region of a size that could
be completed with the funding available.  It was recognized that
it may not be possible to define such a region with high accuracy
because of variation in the ratio of genetic distance to physical
distance.  The goal in this respect should be to avoid situations
where one group obtains scattered regions of sequence that must
eventually be finished (i.e., linked up) by other groups.  

The SPP group will begin nucleating on chromosome 1. The EU group
will nucleate the bottom arm of chromosome 4. The CSH-WU- ABI
group will nucleate a 4 Mb region on the top arm of chromosome 4
and a 2 Mb region on the top arm of chromosome 5 (the latter in
collaboration with the EU group and the Kazusa group).  The TIGR
group will nucleate chromosome 2.  The Kazusa group will nucleate
the lower part of chromosome 5. The region at the top of
chromosome 5 of mutual interest to the EU, CSH-WU-ABI and Kazusa
group will be sequenced collaboratively.  The Kazusa group expects
to begin nucleating a region of chromosome 3 in 1997.  

The EU, Kazusa, TIGR, SPP and CSH-WU-ABI groups anticipate an
average monthly rate of approximately 200, 500, 220, 150 and 150
Kb per month, respectively.  Thus, when all the groups are
operating at full capacity, the average monthly rate for the
entire AGI collaboration is expected to exceed 1.2 Mb per month.
The philosophy of the AGI collaboration is that as the initial
regions near completion, the coordinating committee will designate
new regions of unfinished sequence to the groups in proportion to
their sequencing capabilities.  

Several of the participants had differing views about the relative
merits of sequencing unique sequences versus regions of repetitive
sequence such as centromeres and telomeres.  On the one hand, it
may be expected that the maximum number of coding sequences will
be found by sequencing the regions of low copy number.  On the
other hand, it will be interesting to know the structure of the
centromeric and telomeric regions.  The majority view appeared to
be that it was not necessary at this time to resolve this issue.  

Sequencing efficiency should be the sole criterion for choosing
which clone to sequence.  It was agreed by all parties that none
of the groups should perform service sequencing for outside groups
interested in particular clones.  The reason for this is that  the
sequencing groups should not be seen to be favoring certain
colleagues.

4.  The most efficient strategy for sequencing the Arabidopsis
genome is to shotgun sequence large clones such as BACS, YACS or
inserts from P1 clones.  Most of the groups have had preliminary
experience with BACS and YACS and preferred BACS.  The fact that
most of the groups are currently satisfied with the available
public BAC libraries will facilitate coordination and exchange of
information.  In particular, in order to minimize the requirement
for additional physical mapping, it is desirable to obtain several
hundred base pairs from the ends of a large number of BAC clones 
so that the minimum tiling path from a region of sequence to an
overlapping clone can be determined by database analysis. TIGR
agreed to sequence the ends of BACS from public libraries during
the next two years and to make the information freely available to
the community.

All of the groups will use public BAC, YAC or P1 libraries
constructed from the Columbia ecotype that will be freely
available to the world community.  A  suitable BAC library to
begin with is the TAMU BAC library constructed by Choi et al
(http://probe.nalusda.gov: 8000/otherdocs /ww/vol2/choi.html) that
is currently available at the Ohio Stock Center.  The other BAC
library was constructed by Thomas Altmann and collaborators
(altmann at mpimp-golm.mpg.de) and is also publicly available
(http://rldb.rz-berlin.mpg.de).  A P1 library (the 'M library')
developed by Bob Whittier and colleagues at Mitsui is also
available at the Ohio Stock Center and a second library (the 'K
library') is being tested at the Kazusa Institute.

5.  The objective of the AGI is to obtain high accuracy sequence
of the entire genome.  There was general agreement that it was not
possible to set a standard for exactly what high accuracy means or
for mechanisms to enforce high accuracy.  However, it was
generally agreed that a minimal standard would be that >97% of all
sequence would be obtained on both strands or by two chemistries. 
It was the opinion of the group that these criteria were of
similar importance and that with most clones, about seven- fold
redundancy of sequencing would be required for shotgun sequencing.

An unknown factor affecting the accuracy of the sequence concerns
the fidelity of the BAC clones.  Preliminary experience suggests
that the BACS are generally faithful clones of the genome. 
However, it will be essential to verify the integrity of each BAC. 
A minimum criterion is that both ends of the BAC should map to the
same region of the genome, typically to the same YAC.  When 14,000
BAC ends are sequenced, it is expected that, on average, we will
have 500 bp of sequence every 5 kb on average throughout the
genome.  The resulting library of end- sequenced BACS will
represent a check on BAC integrity that will assist in revealing
any major rearrangements, deletions or additions.  No standard was
agreed upon for BAC (or P1) integrity checking.  However, most
groups indicated that comparing fingerprints of tiled BACS would
be the most appropriate criterion for integrity.

After some discussion, it was agreed that a large-scale effort
toward single-pass shotgun sequencing of the entire genome would
not be worthwhile because the combination of available ESTs and
the high output rate of the AGI collaboration would obviate much
of the value of single-pass shotgun sequencing for gene discovery. 
However, it was also noted that the existing BAC libraries may not
provide complete coverage of the genome and/or may contain small
rearrangements or mutations; a shotgun library of the whole genome
might provide clones to fill gaps and to verify the integrity of
the BAC clones.  The value of this approach will be reassessed by
the coordinating committee as the project proceeds. 

6.  All of the participating laboratories are committed to early
data release via the internet.  One approach discussed at the
meeting involved daily release of preliminary sequence information
(ie., sequences that have been edited to remove vector and regions
of high ambiguity and condensed into >1 kb contigs).  The C.
elegans sequencing groups follow this approach and the community
has found it very useful.  Two of the US groups, the SPP
consortium and the CSH-WU-ABI consortium intend to release data in
this way.  Both groups anticipate release of finished, annotated
sequence within six months of beginning to sequence a clone. The
EU group does not consider it feasible, at the moment, to do daily
releases because the consortium is composed of seventeen
relatively small sequencing groups with varying levels of
technical capabilities.  The EU anticipates release of finished
annotated sequence within one month of completion. The TIGR and
Kazusa groups do not wish to release unfinished sequence because
they believe that carefully edited sequence will be most useful to
the community.  Both groups promised release of information on a
given clone to public databases within three to six months after
sequencing began.  The TIGR group will release finished, annotated
sequence within three months of beginning to sequence a BAC.  The
Kazusa group estimates that they will release finished, annotated
sequence within four to six months of beginning to sequence a
clone.  In all cases, the start date for sequencing a specific
clone will be announced on linked WWW sites so that members of the
community will know when to expect the finished sequence.

In summary, all of the groups agreed to establish linked WWW pages
for posting complete lists of all clones that have been sequenced
to date, along with the start dates of those clones that are still
in progress, and the anticipated start dates for the next set of
clones to be sequenced in the future.  Each clone will therefore
have a start date that will be widely advertised to the community. 
All of the groups anticipate that it will take less than six
months to completely sequence and annotate a BAC, YAC or P1 clone
and that they will deposit the complete annotated sequences in a
public database (eg., GenBank, EMBL, JDB).  No sequence
information will be withheld from the community for the sole
purpose of benefiting selected individuals, groups, or private
companies.

7.  There was consensus that the value of the sequence obtained is
proportional to the quality of annotation.  Thus, each group will
attempt to achieve a common standard of annotation.  Each group
will perform BLAST (or FASTA) searches to align ESTs and known
genes and gene products to the genomic sequence.  In addition,
each group will use programs such as GRAIL and GeneFinder to
identify ORFs.  Annotation should be presented to the community in
a format that can be readily accessed and understood by plant
biologists worldwide.  It was agreed that all unassigned ORFs
would be named according to the C. elegans system.  Details of the
nomenclature system will be included in the final version of this
document. 

It was recognized that annotation of a clone at the time of
deposit in public databases will rapidly be rendered obsolete
because of information about genes being discovered by the
community at large.  Thus, there will be an ongoing need for
annotation of previously sequenced clones.  Because most of the
groups are funded to produce new sequence, it will be difficult
for the groups producing sequence to also take responsibility for
revising the annotation of previously completed sequence.  There
was broad agreement that the task of annotation revision should be
institutionalized by assigning responsibility for revision to the
curators of the Arabidopsis database (AtDB).  The group expressed
its strong enthusiasm and support for the continued funding of
AtDB to make certain that essential informatics components of the
Arabidopsis genome project are not overlooked. Mike Cherry agreed
that it was a suitable responsibility for AtDB and agreed to
accept the task to the extent that resources permit.

8.  It is considered essential to keep the entire community well
informed of technical advances and practical applications of the
genome project.  Each group will mount a WWW page that will report
the contribution of the group to the multinational sequencing
effort.  Each group will also work through Mike Cherry (AtDB) and
the coordinating committee to make certain that community members
receive the training required to make efficient use of the
extensive sequence data that will be generated over the next
several years.  In addition, the coordinating committee will
explore other ways of documenting advances in the Arabidopsis
Genome Initiative to a wide audience.  These efforts should help
to advertise the dramatic impact that sequencing the Arabidopsis
genome will have on basic and applied research in plant biology.



-----
David W. Meinke
Department of Botany
Oklahoma State University
Stillwater, OK  74078 
Phone: 405-744-6549
FAX:   405-744-7673
Email: meinke at osuunx.ucc.okstate.edu
WWW:   http://mutant.lse.okstate.edu/




More information about the Arab-gen mailing list