GenBank Errors

Mon Oct 21 13:43:00 EST 1991

Rather that chastizing Tom Schneider for attempting to raise (perhaps not in
the most diplomatic fashion) critical issues concerning the databases, it is
time to discuss the environment that has led to these conditions.

Over the past ten years or so the US funding agencies have operated on the
assumption that the vast majority of research scientists wish biological data
to be `freely available' and in the `public domain.' This policy has had a
fundamental effect on how the database projects (particularly the
macromolecular sequence databases) have been administered and conducted. What
seems not always to be understood is that these policies have not altered the
basic fact that databases must be paid for. Rather than being assessed directly
for database services as is typical for other essential research materials such
as journal subscriptions, reagents, equipment, etc., investigators are assessed
indirectly by having the general pool of money available for research grants
reduced accordingly.

If one were cynical, one might suggest that all that has really been
accomplished is that the investigators have given up their `right of choice.'
In most other cases, the quality, type, and nature of materials available are
regulated by the laws of an open market. Investigators are free to choose those
products that they deem to be most suitable for their needs and the level of
effort expended in developing these products is controlled by the demand for
them. Because there are no direct feed-back mechanisms for products that are
largely subsidized and for which the government has a near monopoly, the only
choice left to the investigator is to use the service or not; there are no
alternatives. This `all or nothing' approach is at the root of many of the
problems that Tom and others have been raising over the course of the past ten
years. The databases have been put in the position of being everything to
everybody and we are all too familiar with the foibles associated with such

Of course, such an interpretation is overly simplistic. It does not do justice
to the many sound reasons for adopting a `public domain' policy for biological
information and it does not address the fact that the scientific community has
a profound influence over the activities of the funding agencies. The policy
makers and those researchers who serve on the myriad of panels and committees
that allocate biomedical resources and administer these projects are our peers
and our colleagues. They have made a concerted effort to solicit our opinions
and concerns and to reflect the interests of the general scientific community
that they have been called upon to represent.

If there is any blame to assess, it is ours for not taking the responsibility
for expressing our needs and concerns in a constructive manner. As Jim Cassett
and others have attempted to point out, it is neither justified nor productive
to vent our frustrations on a small group of database workers that are simply
attempting to carry out our wishes within the limits of the resources that we
have provided for them. If we accept the model that databases should be `public
domain' and, therefore, directly subsidized, then it is our responsibility to
ensure that the lack of direct feed-back mechanisms for regulating such
projects does not adversely affect them. It is up to us to ensure that
mechanisms are put in place to provide adequate levels of funding for such
projects and to clearly define their missions.

The GenBank(R) project has made great strides in eliminating the backlog of
sequence data and in providing a timely and up-to-date data collection. The
National Center for Biotechnology Information (NCBI) will begin administering
this project in 1992 and, in addition, is initiating their own `back-bone'
database to supplement these efforts. These central repositories will provide a
set of data that directly reflects the published literature and the
experimental sequencing results as submitted to the database centers. The
question that Tom Schneider has raised is not whether these efforts are
worthwhile, but whether they will be sufficient to meet all the needs of the
scientific research community.

The NCBI has recognized the difficulties in the `all or nothing' approach and
is fostering the development of specialized databases that are expected to
refine, analyse, organize, and reformulate the information presented in the
`backbone' in a manner suitable to address individualized needs. The critical
issue is whether the policies that are now being formulated will ensure that
sufficient resources can be made available to allow large scale specialized
database projects to be established and maintained and that these efforts
address realistic needs. Come 1995, if the databases are not adequate, we will
be the only ones to blame. So I suggest that we take an active interest in
these issues today. Perhaps the bulletin board is a good place to discuss these
needs and to evaluate the policies that are now being put in place to meet

This note reflects my own opinions. I do not represent the NCBI, and I have
never served on any advisory panels for GenBank, NCBI, etc.  If I have
inadvertently misrepresented anyone else's position, please feel free to
correct me.

                                        David G. George
                                        Protein Identification Resource (PIR)

