databases search - protein size

Laurent Duret duret at evoserv.univ-lyon1.fr
Thu May 5 10:53:44 EST 1994


In article 36h at lyra.csx.cam.ac.uk, bjd12 at cus.cam.ac.uk (Ben Davis) writes:
> Hi
> 
> I'm trying to find a way of searching databases for proteins (ideally from
> E.Coli) with a mass in a given range (say between 10 and 18 kDa).
> 
> Anyone got any suggestions ?
> 


Manolo Gouy and co-workers has written a very good software named ACNUC 
that allows one to make such query:

- select complete coding sequences from E. coli, which length is comprised between
  250-500 nt (which roughly corresponds to 10 - 18 kDa when translated in aa)
  and then translate them into protein.

example:

(lines preceded by : are those entered by the user)
//////////////////////////////////////////////////////////////////////////////////
:query
             ****     ACNUC Data Base Content      ****                        
                GenBank Release 82 (15 April 1994)                             
180,589,455 bases; 169,896 sequences;  90,795 subsequences; 69,376 references. 
Software by M. Gouy & M. Jacobzone, Laboratoire de biometrie, Universite Lyon I

Command? (H for command list)
:select
Enter your selection criteria, or H(elp) (EX: sp=homo sapiens et k=globin@)
:sp=escherichia coli et t=cds et no k=partial
Sequence list named LIST1      contains  3988 seqs (among which  3889 subseqs)
Command? (H for command list)
:modify
List name, or H(elp)? [default=LIST1]

You can modify this sequence list according to:
1. Confirmation/Suppression of sequences from list
2. Sequence length
3. Sequence insertion date
4. Replace subsequences by seq containing them
5. Add subsequences of seq in list
Enter your choice (1-5):
:2
Give your length threshold: (ex:  L>200  or   L<1000)
:l>250
There are now  3698 sequences in list: LIST1     
Command? (H for command list)
:modify
List name, or H(elp)? [default=LIST1]

You can modify this sequence list according to:
1. Confirmation/Suppression of sequences from list
2. Sequence length
3. Sequence insertion date
4. Replace subsequences by seq containing them
5. Add subsequences of seq in list
Enter your choice (1-5):
:2
Give your length threshold: (ex:  L>200  or   L<1000)
:l<500
There are now   772 sequences in list: LIST1     
Command? (H for command list)
:extract
List name, sequence, or accession #, or H(elp)? [default=LIST1]

Name of file to write extracted sequences? (or GCG)
:my_file
Do you want:
  (1) Simple extraction
  (2) Translate into protein and extract
  (3) Fragments or adjacent sequences
  (4) Regions defined by sequence FEATURES
  (5) Regions adjacent to sequence FEATURES
2
Translating and extracting CB2PIL              
Translating and extracting EC2MIN.ILVH         
Translating and extracting EC2MIN.PE8          
...

Translating and extracting U01159.TRBJ         
Number of extracted sequences:   772
Command? (H for command list)
:stop
STOP: End of ACNUC retrieval program
////////////////////////// end of the example /////////////////////////////


Thus GenBank (release 82) contains 772 E. coli complete coding sequence
coding for proteins which size range between 10 - 18 kDa. 
These sequences are saved in the file named "my_file".

You can also have access and save all GenBank information attached to these
sequences.

Here I only have GenBank, but ACNUC is also available for EMBL and PIR-NBRF.

ACNUC allows many different requests with a relatively simple query language.

ACNUC is distributed by anonymous ftp from the internet address:
biom3.univ-lyon1.fr    or, numerically,    134.214.92.37
The directory there is /pub/acnuc

I include bellow the README file provided at this FTP site.

I hope this helps,


Laurent Duret

================================================================
Laboratoire de Biometrie, Genetique et Biologie des Populations
Bat 741 - URA CNRS 243 Universite Claude Bernard - Lyon I
43, Bd du 11 Novembre 1918 
69622 Villeurbanne cedex FRANCE

Tel: 	+33 72.44.81.42
Fax:	+33 78.89.27.19
E-mail:	duret at biomserv.univ-lyon1.fr
================================================================



============================= README   =================================

                                ACNUC
             A RETRIEVAL SYSTEM FOR GENBANK, EMBL, AND NBRF/PIR

ACNUC is a retrieval system for the nucleotide sequence databases GenBank
or EMBL and for the protein sequence data base NBRF/PIR.

ACNUC is known to run on Sun (SunOs or Solaris), IBM Risc workstations, 
SGI computers, Dec-alpha systems, and VAX/VMS systems. 
It should be easily installed on most unix platforms. Contact me for help
for other unix systems.


ACNUC allows to select sequences from many criteria from these 3 data
bases, to translate protein-coding genes in protein, and to extract
selected sequences in user files. ACNUC is unique in providing direct access
to coding regions (e.g. protein coding regions, tRNA or rRNA coding regions)
of DNA fragments present in GenBank and in EMBL.

ACNUC is distributed by anonymous ftp from the internet address:
biom3.univ-lyon1.fr    or, numerically,    134.214.92.37
The directory there is /pub/acnuc

ACNUC is available in two different formats:
	1) Interfaced with the flat files as distributed by GenBank, EMBL, and
	NBRF/PIR. These flat files can be obtained from the data base
	distribution centers by ftp, by tape, or by cd-rom.

	2) (NOT FOR NBRF/PIR) Interfaced with the GCG package.

If the GCG package is installed on your site, then choose format 2) above
because you will not duplicate the data base on your computer. You install
a new database release for the GCG package yourself. Then you proceed to ACNUC 
installation that will access GCG files in read-only mode.
If the GCG package is not installed on your site, choose format 1) above.
If the database flatfiles are not already on your site, the acnuc installation
procedure provides a procedure to get these files by ftp. Flat files are 
accessed by ACNUC in read-only mode.

ACNUC is made of:
	1) a data base, that can be in one of 2 formats as said above;
	2) a retrieval program, named querydiv.
	3) a set of index files that are distributed by ftp by us.

The retrieval program is written in FORTRAN (with a few routines written in c).

ACNUC is updated at each new GenBank, EMBL, and NBRF/PIR release.

ACNUC installation is described in file install_acnuc.doc.

M. Gouy
Laboratoire de Biometrie
Universite Claude Bernard
69622 VILLEURBANNE, France
+33 72.44.81.42
E-mail:  mgouy at evomol.univ-lyon1.fr
==================================== end of file ================================




More information about the Embl-db mailing list