Arabidopsis Database 3 of 4 parts

Chris.Somerville 21847CRS at MSU.EDU
Sun Aug 8 23:03:00 EST 1993


II.  How will the database be used? What links should be made
between categories of information?
      In addition to the specific ability to perform searches as
described in the previous sections, the categories of information
must be linked with user-friendly interfaces.  Attention should be
paid to tight coordination between the genetic map and related
genes, clones, and sequences, so that selection of any of these
will lead transparently to accession of the others. Also, it is
highly desirable for the database to have simple links for
comparative sequence and mutant analysis with other plants and
beyond that, with all organisms.  The interface should allow
viewing in a variety of ways.
      As examples of the types of links desired, we add a series of
questions that the system should be able to answer are listed
below.  Most of these examples were suggested by current users of
AAtDB and AIMS, the information management database associated with
the Arabidopsis Biological Resource Center at Ohio State
      - If a user enters two cloned markers, the system should
return a list of all markers of a specified type that map between
      - If a user points to a location on a genetic map, the genes,
clones, and sequences should appear.  Likewise, a user should be
able to derive map position if a DNA sequence is used as the
starting point.
      - If a user finds a specific clone, the expression pattern of
the RNA encoded by the clone should be readily accessed.
      - If a user finds a mutant that is altered in a particular
way, the system should retrieve all mutants altered in a similar
manner.  A cross-species accession to similar mutants in other
plants might be useful.
      - If a user desires to see the map positions for all genes in
a given biochemical or developmental pathway, she should be able to
do so.
      - If a user has new mapping information, the system should
have the ability to download archived data in that region for

III.  What community issues must be considered in the design and
operation of the database?

A.  Advisory Committee
      An Arabidopsis database (ADB) proposal should include a
provision for an advisory committee that will represent the
community of Arabidopsis researchers and will advise ADB
investigators on priorities and data to be included.
      One model of how an advisory committee would function is based
on an analogy between ADB and a scientific journal.  In this model,
the ADB investigators who are funded by an ADB grant and are
responsible for database assembly, would be analogous to the
publisher of the journal.  The ADB curator(s) (a permanent
professional position) would be analogous to the managing editor of
the journal.  The ADB advisory committee would be analogous to the
editor-in-chief plus the senior editors of the journal.  Finally
the ADB advisory committee would appoint an editorial board which
would evaluate specific submissions to ADB in the same way that the
editorial board of a journal reviews submitted manuscripts for
scientific content and appropriateness.
      The Chairperson of the ADB advisory committee could be
appointed by the PI of the ADB grant in consultation with the
granting agency and the North American Arabidopsis Steering
Committee (NAASC).  In addition, the advisory committee could
consist of four additional members who would be appointed to four
year terms by NAASC, which should take care to include an
international representative.  Initially, two of these committee
members would have two-year terms instead of four year-terms to
enable two members to be replaced every two years.
      The ADB Advisory Committee should provide an annual written
report that would assess the progress of the database and make
recommendations for the coming year.  This report would be a part
of the annual progress report of the ADB grant to the funding
      One role of the advisory committee will be to work with the
ADB curator to establish a standardized Arabidopsis nomenclature
for cloned genes that is consistent with other genome databases.

B.  Curation, Entry, Correction, and Long-Term Storage of Data
      Curation: Because ADB will be relatively small compared to the
human genome data base and will most likely have limited funds,
developers of ADB should make every effort to leverage the database
activities undertaken elsewhere and adapt existing software, when
appropriate, for use in the Arabidopsis research community.  Thus,
the major activity of ADB will be the collection, entry, and
correction of data rather than writing software for storing,
retrieval, and presentation of data.  Therefore, a full-time
professional curator(s) will be a key person for the efficient
operation of ADB.
      Ideally, the curator would be an Arabidopsis biologist with
extensive computing experience.  Computing experience is important
because key features of the data in ADB will be compatibility with
the data in other genome databases and portability to future
generation Arabidopsis databases. Experience as an Arabidopsis
biologist will be extremely helpful in devising ways of collecting,
entering, storing, retrieving, and presenting data that Arabidopsis
biologists would find useful.
      The ADB curator will work closely with the ADB advisory
committee in determining the categories of data stored in ADB and
in designing the structure of the data storage system.  It is
important that this design be forward looking, anticipating the
increasing complexity of genomic data that will surely occur as
more regions of the genome become well known.
      The ADB curator will also interact closely with the ADB
editorial board as well as with individual contributors, in the way
that the managing editor of a journal communicates with both
authors and referees.
      Data entry and correction: It should be assumed that much
responsibility for entry or submission of data to the database will
rest with the research community.  The database proposal should
consider how users will simply and efficiently enter data.  Data
submission tools should be useful, and carefully thought through.
There should be a robust procedure by which researchers will be
able to oversee the quality of data and make corrections.
Corrections should be entered by the curator, and should not be
simple over-writes of existing information.  Instead,
author-supplied corrections should be new entries, providing
updated information.  In this way, author updates can be edited and
removed, if desired.
      The ADB curator will be responsible for designing user
friendly data entry software that assigns a unique accession number
to each entry.  A great deal of thought must be given to the
mechanism by which data is refereed before permanent entry into
ADB.  Some data such as DNA sequences may require relatively
little, if any, refereeing.  A major decision to be made is whether
any data will be entered permanently by members of the community
without the intervention of the curator.
      Some data such as references to published work can be entered
directly by the curator.  Other data generated in individual
laboratories can be entered directly by members of the community.
In other cases, however, the curator will have to work closely with
particular members of the advisory committee, the editorial board,
or the Arabidopsis community to collect, collate, and present
certain types of data, such as two and three factor mapping data
used in the construction of genetic maps.
      Individual researchers should not be allowed to make
corrections or alterations in existing data.  Rather, corrections
or additions to existing data should be entered by the curator in
consultation with the advisory committee or editorial board if
      Special attention should be paid to sites entering major
amounts of data: they should bear much of the responsibility for
data entry in their own areas, perhaps by training a person at that
site to be an "assistant curator." There must be explicit
consideration of how the curator will collect and enter data from
major data producers.
      The issue of linking formal publication of information in
peer-reviewed journals to submission of the underlying data in ADB
should be addressed.  The requirement by most journals that
descriptions of sequences be accompanied with accession numbers of
the sequences in appropriate databases has had a positive effect on
making sequence data easily available to the community.  Some
similar system for linking published papers with material in the
ADB should be considered.
      Long-term storage: Databases are like libraries or journals.
Once they are created, the need to access them will continue
indefinitely.  Long-term funding is required to keep information
current and to gather information which is not already in
electronic form.

C.  Relation to other Databases and Programs
      When possible, ADB should use industry-standard hardware and
software, so that ADB is both compatible with and can communicate
transparently with other data bases.  However, as stated elsewhere
in this report, the primary goal of ADB should be to collect and
store data using currently accepted database models rather than to
develop new database software specifically for ADB.  The most
important principle, therefore, in the design of ADB is that the
data be entered in a form that makes it possible to interface
easily with other databases and which makes the data in ADB
portable to future generation database software.  Any software that
is written specifically for ADB (display of genetic maps, for
example) should be layered and use industry standard interfaces so
that the software, as well as the underlying data, is also
compatible with and portable to future generation databases.  The
general principle is that is only makes sense to spend money on the
development of generic databases that can be used for a variety of
different genomes.
      An ADB proposal should discuss the software used in other
genome databases and how ADB will relate to that software.  Since
funds for developing an Arabidopsis database from scratch are not
available, programs and methods from other databases must be
considered for use by the Arabidopsis proposal.
      The data from the Arabidopsis database must be accessible by
analytical programs for DNA sequence analysis, genetic mapping,
etc.  Therefore, an application programming interface should be
      ADB should interface transparently with other genome databases
(or other ADB data bases if there are more than one) such that a
user can ask a question of or retrieve data from more than one
database simultaneously.

D.  Availability of the database.
      Data accumulated by a publicly funded database should be
community property. There should be no restrictions on the
availability of the data in ADB.  ADB must be available

More information about the Arab-gen mailing list