Trial Database: PROM

birney at birney at
Sat Jan 15 11:00:02 EST 1994

	I have developed a (very) small scale database system which tries to
consolidate information on protein motifs: At the moment it is only
concerned with one motif, the RNA Recognition Motif (RRM) AKA 
RNA Binding Domain (RBD) or RNP-CS (RiboNucleoProtein Consensus Sequence).
This is a large super family of proteins common throughout RNA 

	The database runs on an email server, like BLAST etc: you send
a query and it mails back selected data. I'm calling it PROM for 
PROtein Motif.

	The system is now ready for a bit of field testing (crossing my
fingers). If anyone out there works on RNA and wants to poke around, please
do! The information it contains is to the best of my knowledge correct,
but the system *might* be a little cranky.

	Please, if you use it, email me personally if you think anything 
is going wrong or if in particular things you would like to see
such a system carry.

Ewan birney

birney at

here's the help file!

		        *P R O M*

	PROtein Motif database system.

This is the help file for the PROM database system. 

Currently this is in the experimental stage.

This is a very small scale system which aims to provide accurrate
information for selected domains, integrating the sequences with
domain positions and references wherever possible. Currently 
only the RNA Recognition Motif (RRM or RBD or RNP) is being run but
other motifs will hopefully be added. 

	to get this message type "help" in the main body of the message.


	Much of this work on the RNA recognition motif is detailed in
"E. Birney, S. Kumar and A. R. Krainer: Analysis of the RNA recognition motif,
RS and RGG  domains: Conservation in pre-mRNA splicing factors. NAR 21: 


	PROM organises sequences under a 'common' name, which is hopefully
the most usual name used in the literature for the gene. This is given
in UPPER case preceeded by three lower case letters denoting species. 
Whenever it is clear that two proteins from the different species are
performing the same function they are given the same common name. 
	Each common name has associated with it a list of accession numbers
from genbank and locus names from the swiss protein database. Any of these
can be used to refer to the sequence of interest. To retrieve information
about a sequence use the commands either

	"report <name>"
or	"refer <name>"

where name can either be the common name (if you already know it, or can
have an educated guess) or the accession number from genbank or the locus
name from swiss prot.
	The report command will give you the common name by
which PROM knows the the sequence, a list of accession numbers and
locus names for that the sequence and the domains with their amino acid
positions for that sequence.
	The refer command will give you the same list along with a series
of references associated with that sequence.  

	Please note that every sequence can be identified by the accession
numbers of the sequences in genbank: any one of them will do. This is probably
the easiest way to cross reference a sequence from a different system. Most
sequences have more than one (if not many) accession numbers associated with
it, which is often a cause of much confusion.

	Lower case should be used for letters in the accession numbers and
locus names (m49094 NOT M49094 and roa1_human NOT ROA1_HUMAN). 


	PROM uses the BLAST program written by Gish and Altschul
(Altshul et al., (1990) JMB, 215, 403-410) which has been ported to run
under VMS by Peter Stockwell. I picked this up at the ftp site at
EMBL-heidelberg in the pub/software/vms dir.

	A database highly enriched in RRM sequences is maintained and can
be searched with BLAST. To search with a sequence email a message 
with "search" on the first line, followed by "sequence" on the
second line followed by the sequence in either IG, FASTA or GCG format. FASTA
(Pearson) format is that used by BLAST.  
	In the future qualifers between "search" and "sequence" will be
allowed (hopefully).
	The sequence file has to be the final part of the mail message: 
everything past "sequence" is taken to be part of the sequence file.
	The BLAST output is emailed back


	Two different stand alone files are provided. A table of all the 
RRM containing sequences (precisely the same as the table provided through
the file server), and a GCG profile (.prf) file of the RRM, which is used for
identifying RRM containing sequences for the database. To retrieve them
mail the commands
		send table 
	or	send profile

	The table file has the sequences somewhat grouped by phylogeny 
and function, though it is becoming less useful as the number of 
sequences grow.
	The profile file can probably be used as is, with the mail 
message in GCG programs. (GCG programs allow a free format headers to
their files which is terminated by a .. ). If not strip out the 
mail header and it should be fine.


the message :

	report humSF2
	report gp:m64603
would produce the following output:

	The sequence humSF2
	accession number gp:m69040
	accession number gp:m72709
	Domain 1, type RRM, starts 17, ends 96
	Domain 2, type RRM, starts 122, ends 197

	The sequence schPABP
	accession number gp:m64603
	Domain 1, type RRM, starts 67, ends 151
	Domain 2, type RRM, starts 154, ends 238
	Domain 3, type RRM, starts 248, ends 331
	Domain 4, type RRM, starts 354, ends 434

the message 

	refer x62447

would produce the following output:

The sequence humSC35
accession number gp:x62447
accession number gp:x62446
accession number gp:x62447
accession number gp:m90104
Domain 1, type RRM, starts 15, ends 98
Reference, Cloning
Vellard,M., Sureau,A., Soret,J., Martinerie,C. and Perbal,B. (1992) A
potential splicing factor is encoded by the opposite strand of the
trans spliced c myb exon. Proc. Natl. Acad. Sci. USA 89, 2511 2515.

Reference, Cloning
Fu,X. D. and Maniatis,T. (1992) Isolation of a complementary DNA that
encodes the mammalian splicing factor SC35. Science 256, 535 538.

Reference, General
Sureau A; Soret J; Vellard M; Crochet J; Perbal B The PR264/c myb
connection: expression of a splicing factor modulated by a nuclear
protooncogene.  Proc Natl Acad Sci U S A. 1992 Dec 15; 89(24): 11683

Reference, General
Fu XD; Mayeda A; Maniatis T; Krainer AR General splicing factors SF2
and SC35 have equivalent activities in vitro, and both affect
alternative 5' and 3' splice site selection.  Proc Natl Acad Sci U S
A. 1992 Dec 1; 89(23): 11224 8

Reference, General
Mayeda,A., Zahler,A.M., Krainer,A.R. and Roth,M.B. (1992) Two members
of a conserved family of nuclear phosphoproteins are involved in pre
mRNA splicing. Proc. Natl. Acad. Sci. USA 89, 1301 1304.

the message:

; LOCUS       HUMSRP75A_1
; DEFINITION  Human pre-mRNA splicing factor SRp75 mRNA, complete cds. SR
;             protein family member; SR domain: (bp. 583. .1529); RNA binding
;             domains: RNP-2 (bp. 57. .80) and RNP-1 (bp. 150. .173).
; DATE        06-JUL-1993
; ACCESSION   L14076
; ORGANISM    Homo sapiens Eukaryota; Animalia; Chordata; Vertebrata; Mammalia;
; ORIGIN      Translated using phase 1

would search this sequence against the rrm enriched database and mail you
back the BLAST output.


	The real aim of this database is to fully understand certain
sequences, in this case those with a particular motif. This motif is quite
hard to assign by either eye or the simple PROSITE signature - in many cases
sequences are missed and wrong sequences are assigned. Furthermore this
database tries to collapse series of multiple entries in the large databases
when they are actually the same (or fragments of the same) sequence. It also
provides you with a database to run searched against which is very useful:
One can ask the question "Does any sequence contain both an RRM and a
xxx motif". 

Database searches

	Every sequence in the enriched database is not guarenteed by any means
to contain an RRM. However RRM containing sequences make up around 50% of the
database, compared to below 0.1% in the genpept database. The cut off is
chosen somewhat liberally to try to ensure that every sequence of interest is
included at the expense of allowing some spurious sequences through. If you
have a sequence of interest, to see whether I think it has an RRM simply 
mail PROM with "report <accession_number>". 
	The database contains the full sequences, not simply the RRMs.


	The references provided are from my own personal database which I
have built: I make no attempt here at providing all references. However if
people would be willing to add a list of references that they have that would
be a huge benefit to others. email me to work details out.

Missing/misplaced sequences

	There will be some. Currently the chloroplast binding proteins are
giving me problems, but no doubt there are others that have slipped through.
If you have a sequence that you think should be included or you don't
feel is in the right section/phylogeny on the table then please email me
and I'll try to sort it out. 


	Please tell me what YOU think would be useful to provide and ways
to improve the system. 

Have fun!

My address: birney at

	Ewan Birney
	Balliol College

More information about the Bioforum mailing list