converging databases{entrez,blast,fasta}: SOUNDING for critical contr

Leslie Taylor ltaylor at socrates.ucsf.edu
Tue Aug 3 13:34:41 EST 1993


	This note briefly describes the first proposal for a NEAR term solution
to the duplicate database problem.  "Converging on a Sequence Database Format 
Different Package Could Share" is the software developers' topic for 
3pm Thursday August 19th in Waterville Valley.

	If you are new to this discussion, the motivation for the 
sharing of databases is outlined at the end. 

	So far, JUST ONE CONCRETE PROPOSAL IS ON THE TABLE.  Please send 
ahead brief descriptions of alternatives so people can prepare, or if you 
plan to wait until the 19th, please take this description as an example 
of a technical starting point considered concrete and implementable.

***********
BRIEF DESCRIPTION OF THE WHEELER PROPOSAL:

	Dave Wheeler has proposed that the NCBI distribute a form of 
the "ENTREZ" package (software and data) that can share sequence data with 
the "SEARCHfmt"  data.  The change simply replaces the sequence information in 
the "seq-data" field in (ASN.1 formatted) ENTREZ data with a POINTER 
to the corresponding sequence information in the SEARCHfmt database. 
The SEARCHfmt data was already proposed to contain the "giim" sequence 
indentifiers that point to entries in ENTREZ so no modifications to 
the SEARCH database would be necessary.

	Glossary and references associated with Wheeler's proposal:

	SEARCHfmt:  A format used by the blast server and usable by fasta.
		Use anonymous ftp to ncbi.nlm.nih.gov in directory pub/searchfmt
		OR GOPHER to National Center for Biotechnology Information (NCBI)

	ENTREZ: Interface for retrieving sequences and references on net or CDROM
		Use anonymous ftp to ncbi.nlm.nih.gov in directory entrez/docs
		OR GOPHER to National Center for Biotechnology Information (NCBI)

END OF BRIEF DESCRIPTION of the proposal

***********
REMINDER OF WHY WE WANT TO CONVERGE ON A DATABASE TO SHARE BETWEEN PACKAGES?

WHY converge on a database to share between packages?

1.  So MAC and PC users could download for any of their packages 
(or have client/server network access to ) fresh data nightly for 
any of their packages.

2.  So all users may have the advantage of multiple application packages 
accessing the databases without having to maintain separate copies for each 
package.

	Why use multiple application packages currently in existence?
		Because some have superior user interfaces.
		Because none is comprehensive over all manipulations.
		Because some require bigger CPU's.

	Why permit future products to share databases with older products?
		To smooth upgrade paths for user interfaces and applications.
		To permit developers to concentrate on UI and bio-analyses.

	Why decouple User Interface and low level database software development?
		To permit developers to stick to their area of expertise.
		To permit sequence databases to catch up technologically.

Leslie Taylor   Sequence Analysis Service           email:ltaylor at cgl.ucsf.edu 
Computer Graphics Lab UCSF 		            office: (415) 476-5379 
Box 0446 Room S926 			            fax:    (415) 502-1755
513 Parnassus Avenue San Francisco, CA 94143-0446
-- 
Leslie Taylor   Sequence Analysis Service           email:ltaylor at cgl.ucsf.edu 
Computer Graphics Lab UCSF 		            office: (415) 476-5379 
Box 0446 Room S926 			            fax:    (415) 502-1755
513 Parnassus Avenue San Francisco, CA 94143-0446



More information about the Comp-bio mailing list