Arabidopsis Database 4 of 4 parts

Chris.Somerville 21847CRS at MSU.EDU
Sun Aug 8 23:03:00 EST 1993


PART 3 FOLLOWS

IV.  What design-feature issues need to be considered?
      As an aid to those who will plan to submit proposals for
Arabidopsis database services, the committee discussed at a general
level some of the design features that would allow the ADB to serve
the community with maximal efficiency, and recommended that any
proposal for database services include a discussion of these design
considerations.  In addition, the committee recommended that any
Arabidopsis database proposal would consist of two parts. One
should be for biologists, describing what the database could do,
with examples like those in the list of examples given above.  The
other should be for database and computing experts, to show how the
goals will be achieved at a technical level, and how the methods
proposed relate to existing technical methods in use in genome
databases.

A.  Design considerations that should be discussed in the proposal:

      1.  Controlled vocabularies.  The use of plain text comment
fields and other fields with uncontrolled vocabularies must be kept
at a minimum.  Any field upon which a user might be expected to
initiate a search should contained controlled-vocabulary entries.
It will not do, if users must, for example, know every synonym for
a particular phenotype in order to obtain information about that
phenotype.
      2.  System portability.  Data must always be in a form that
will be portable to new database systems, and new computers and
computer types.  Explicit description of how this will be achieved
should be included in the proposal.  Industry standard methods for
data storage and transport should be discussed in relation to the
methods proposed for the Arabidopsis database, with the aim of
being able to transport data and functionality to new database
systems. Current industry trends are towards layered software
systems and client-server databases and we would expect submitters
of proposals to explicitly discuss current trends such as these in
the proposal.
      3.  Bulk data transfers.  All data should be available for
bulk access or bulk downloading, as discussed above.  The methods
by which this will be assured should be described.
      4.  Data time stamping.  The database should have update
information on its own contents: date/time coding should be
considered for all data and links between data, so that update data
can be obtained by users on a regular basis.
      5.  Cross references.  The database should contain
cross-references to the following databases (where available):

      - GenBank Nucleotide Sequence Database
      - Arabidopsis thaliana stock center database
      - Cell and/or probe repository catalog number(s)
      - Genetic map databases for species showing significant
      synteny with Arabidopsis thaliana

      6.  Database structure.  To ensure the development of a robust
and stable production quality system, the database should be based
upon readily available, proven software.
      7.  Database interoperability.  To facilitate inter-database
linking, and referencing of items in the database in an unambiguous
manner for publications and other reports, and to ensure the
long-term ready availability of the data in the database, primary
entities ("unit records") should be identified by public,
unchanging unique identifiers (accession numbers).
      8.  Data dictionary.  The database structure should be
described in a data dictionary or repository which would be
available to database users.  In addition to documenting the
database schema and providing prose descriptions of the database
tables and data elements (fields), the data dictionary should
document integrity constraints implemented in the database (or in
layered software) and the rationale for the database design.
Copies of the data dictionary should be available to users at a
nominal cost.
      9.  Software.  All software developed under this contract
should be designed and implemented for maximal portability
consistent with timely and cost-effective delivery of service.
      10.  Computing facilities.  To facilitate user and developer
access, the database should be maintained on a computer (or network
of computers) which uses the Unix operating system and which is
connected to the U.S. Research Internet by communications lines
which operate at a minimum of 1.54 Mbits/sec.
      11.  Performance.  The database should respond to user queries
in a reasonable time, and the database developers should regularly
monitor the system efficiency and response times and should take
corrective steps whenever response degradation becomes significant.
      12.  Documentation.  The developer should maintain complete
documentation and source code for all software developed in this
project and complete documentation for all database design and
implementation.  This information should be made available to the
funding agencies and to others to ensure the long term
functionality of the system and to ensure long term access by the
scientific community, even beyond the termination of the
developer's involvement in the project.
      13.  Security.  The database will represent an irreplaceable
resource for the community.  Therefore, the developer must take
care to ensure that the data and programs are protected against
loss.

B.  Short-term research goals
      The developer should consider and propose to carry out some
short-term research relevant to improving the quality of the
Arabidopsis thaliana database.  Some possibilities for short-term
research would be:

      1.  Add capability of representing and storing data for other
plant species.
      2.  Explore the abstract and generic nature of mapping data
and develop generic representation systems in anticipation of
adding mapping data generated by as yet undiscovered mapping
techniques.
      3.  Develop methods for defining and controlling differential
access to the data.
      4.  Develop a means for providing an audit trail, or other
historical record, of all changes to the database.
      5.  Investigate methods for facilitating interdatabase
interactions and connections.
      6.  Develop a stable, documented application program interface
(API) to the database.
      7.  Develop a method for representing variations in data
quality and for recording uncertainty.
      8.  Develop means for integration of physical mapping data
with genetic and cytogenetic maps.
      9.  Develop means for providing ready user access to
underlying supporting data (maintained in remote laboratory
databases) through the database on-line user interface.
      10.  Develop improvements in data presentation, including
graphical representation of maps.

C.  Possible long term research goals

      1.  Investigate new database systems and new data models.
      2.  Monitor advances in hardware improvement and develop plans
      for using new hardware to improve the quality of the database.

      All research projects should include specific plans for
production of prototype systems and for their acceptance testing by
the appropriate user communities.

Participants
      The Arabidopsis Informatics Needs Assessment Workshop, June
5 and 6, 1993, Dallas, Texas, was attended by the North American
Arabidopsis Steering Committee, the elected representatives of
Arabidopsis researchers:

  Elliot Meyerowitz, California Institute of Technology (Chair)
  Fred Ausubel, Massachusetts General Hospital and Harvard School
    of Medicine
  Joanne Chory, Salk Institute
  Joseph Ecker, University of Pennsylvania
  David Meinke, Oklahoma State University
  Chris Somerville, Michigan State University

and by:

  Machi Dilworth, National Science Foundation
  Steven Heller, U.S. Department of Agriculture
  A. Vassarotti, Commission of the European Communities
  Ken Fasman, Genome Database, Johns Hopkins University
  Robert Robbins, Laboratory for Applied R


More information about the Arab-gen mailing list