Database integration with sequence analysis software - how?

Arne Mueller a.mueller at icrf.icnet.uk
Tue Aug 10 14:45:43 EST 1999


Piotr Kozbial wrote:
> 
> I am interested in testing several ideas about organization of genomic
> information.
> 
> Could you please send me references about:
> 
> 1. Sequences management in relational databases.

Hi,

I think storing sequence data with all the known biological information
in a relational SQL database must be be far more efficient than anything
else...
However smart database design seems to be a tricky thing and
bioscientists like
me like the old text formated files, also portability is easy using
these flat.
text files. However I decided to use an SQL database to manage all my
sequence
data plus all the additional stuff connected to the sequences.

> Databases, I know, store data in tables and rows, but sequences seems to
> be stored in flat files (i.e. in FASTA format). Is it good idea to chop
> the sequences and transfer them into relational database? Some kinds of
> sequences are well suited for storage in relational database (i.e.
> protein and cDNA sequences), but genomic sequences are not. Is it good
> idea to cut genomic sequences into fragments containing  ORFs with
> theirs upstream and downstream sequence, and with some positioning
> information (i.e.. IDs of upstream and downstream ORFs). With each ORF
> in the database it is possible to store additional information (computed
> or taken from known literature) like:

Hm, if the genome identification is complete I'd say yes, split the
genome into
ORFs including all the positioning data, regulatory elements (if known)
etc. . I
know not much about relational databases but it seems to be a
(biological) problem
splitting the data and than connecting it again when searching the
database. 
Honestly - I've no idea about THE ideal solution, so I'd split the data!

> -cDNA sequence,
> -IDs of known aa motives,
> -ID of known conserved structural domains,
> -ID of interacting proteins,
> -pre computed information about structural, sequence, and functional
> homologies (similar to "neighbors" in NCBI databases),
> -all other information (especially raw experimental data),
> 
> 2. There are lots of tools for sequence analysis written in perl, c,
> c++, etc.
> How the interface between the database and the tools should be designed?
> Are there any examples?

I'd use some software in the middle. You may need a tool that performes
the sql
database query and writes the sequences (results) in a common format
(e.g. fasta) 
to a your whatever program (Fasta, Blast ...), otherwise you've to hack
the code 
of the existing programs you use. 

HTTP-saervers and databases work together via cgi-scripts, there are
lots of 
intermediate software packages that are intended to make live easier to
communicate
between applications and database-servers. If you're interested in MySQL
and related 
software have a look at http://sunsite.icm.edu.pl/mysql/ .

Is there any documentation about integrating biological information in
relational
databases, any staring points? 

	that's lots of text with minimal help - however maybe we start a
discussion,

	Arne

-- 
Arne Mueller
Biomolecular Modelling Laboratory
Imperial Cancer Research Fund
44 Lincoln's Inn Fields
London WC2A 3PX, U.K.
phone : +44-(0)171 2693405      | fax :+44-(0)171-269-3534
email : a.mueller at icrf.icnet.uk | http://www.icnet.uk/bmm/




More information about the Bio-soft mailing list