GenBank errors

Paul Gilna pgil at HISTONE.LANL.GOV
Mon Oct 21 12:59:52 EST 1991

	Let me try and place some (hopefully calming) perspective on
	this business, which as Sanjay correctly says is about the
	right issue, but on the wrong plane.

	The issue is not about the fact that "errors", as variously
	defined along the way to be ours, yours and everyone elses,
	exist in today's databases whether as a result of past
	travesties or present inadequacies; heck, we knew that!, the
	issue is what are we going to DO about it.

	Let me try, as accurately and as unemotionally as I can, to
	describe what our approach is to this problem.

	Firstly, we have a job to do; acquire (some say extort!) the
	data from you and return it to you in as rapid and as accurate
	a manner as possible.

	The rate of growth of these data is, as Tom rightly bemoaned,
	exponential; GenBank's doubling time is creeping below the one
	year mark, and there are times here where we put out in one
	week what it took us one year to accomplish not too long ago.

	That we are managing to do this is no accident--our philosophy
	has been that in order to deal with a volume of data that doubles
	every year, yet maintain roughly steady-state staffing levels,
	we must double our efficiency every year just to keep pace.

	The only other alternative is to allow our budget to follow the
	same growth curve! But we long ago realised that throwing
	people at the problem was not the answer.

	There are many paths we chose to achieve this, but there are
	two I will mention because they have a bearing on this issue.
	One was conferrence of the responsibility for "data entry" to
	the community, the other was automation of that transfer.

	Most of you reading this know by now that we receive 80-90% of
	our data in electronically submitted form, what you dont yet
	know is that about 50% comes in automated, transaction form--in
	the space of one year, the "market share" of Authorin
	submissions has made dramatic leaps, such that about 50% of the
	data we receive enter the database completely
	automatically--leaving us with only the process of data

	These two acheivements have relevance in the following manner:

	In the past Genbank suffered because it believed it "owned" the
	data, and therefore expended precious time "curating" those
	data.  Today, we must try to shift that onus to the author,
	because ultimately THAT is the level at which error detection
	and correction will have to happen. The simple fact remains
	that given the volume, GenBank cannot expect to "Curate" the
	entire database in order to keep it accurate, and it is neither
	money nor time that keeps us from this, rather we believe that
	is the responsibility of the true "owner" of the data, the
	submitting author. We cannot commit to running a project that
	will spend a significant proportion of its resources correcting
	data "after the fact".

	Even having said that, the fact that so much of our data no
	longer require manual data entry, means that in the interim,
	the same staff can apply themselves more efficiently to
	validating the data instead if interpreting it.  Our job then,
	is to provide that staff (and indirectly the community) with an
	increasing battery of checks against the data.

	However, with the advent of automated submission tools that are
	now a reality we can begin to capitalise on this by providing
	that same level of data integrity checking AT THE LAB BENCH.

	We cannot hope to "clean up" the errors of the past unless we
	can do so fairly automatically; without the power of a
	relational database at our disposal we could not hope to even
	begin to address these needed changes--we would still be
	plugging away entry by entry at the flatfile level.  As Michael
	pointed out we have already targeted what we perceive to be the
	most glaring errors and we are devoting both programming and
	review effort to correcting them. In addition, the curator
	program has already begun to bring in a number of scientists,
	each of whose goal is to "fix" selected regions of the

	Conferring the responsibility for data entry to the community,
	conferring the responsibility for "data release" to the
	community (watch for this one!), conferring the responsibility
	for data correction and updating to the community, the GenBank
	Curator program, author response/review feedback loops, public
	error communication channels, all point to one (albeit cliched)
	philosophy on the part of GenBank:

		In order to succeed in managing the problem of an
		exponential growth of data, we must facilitate the
		community to become part of the solution to, rather
		than the source of that problem.

	Tom said "get on with it"; I say: "we are, and we're not finished yet"



More information about the Bioforum mailing list