A problem in Database searching

frist at ccu.umanitoba.ca frist at ccu.umanitoba.ca
Fri Jun 26 11:44:29 EST 1992


In article <01GLMO0VRQX800036Y at CINVESMX.BITNET> FVEGA at CINVESMX.BITNET writes:
... mailer routing messages deleted...
>
>        Dear Netters,
>
>        I have a problem in database searching that I hope someone out
>there could help me. I am interested in locating genes of E. coli that
>end with the UGA stop codon and which partially overlaps the AUG start
>codon of the following gene. That is, I am looking for a AUGA pattern,
>but just those cases that indeed are gene overlapings.
>
>        I used PatternSearch from the GCG Package, but as most can imagine
>the background of all AUGA sub-sequences that does not correspond to real
>gene overlapings is enormously high. I dont dare to inspect this output
>to locate by hand (looking at each GenBank entry) the real overlapings..
>
>        Does someone knows a software that could do this for me?
>
>        Many thanks,
>
>
>
>        Francisco M. De La Vega
>        Department of Genetics and Molecular Biology
>        CINVESTAV-IPN, Mexico City, Mexico.
>        E-Mail: FVEGA at CINVESMX.Bitnet

The XYLEM package can get you part of the way. Just as an experiment, I
tried took the following steps:

1) Create a list of E coli sequences.
Since XYLEM creates index files with GenBank LOCUS names in the order
they appear in the files, all names for a given species are grouped
together in the index. By pulling out the block of index lines for E coli,
we now have a list of all E coli sequences. (I did this with a single
command using the vi editor.) This file is called ECO.nam. The first
11 lines are shown below:

ECO16S23S    X12420    74316    48682
ECO1721DNA   X61367    74356    48691
ECO21SUL1    X15371    74461    48841
ECO2MIN      X55034    74515    48865
ECO3926PA    X14236    75158    49244
ECO42RNA     X01895    75216    49286
ECO5388      V00252    75238    49289
ECO571MR     M74821    75267    49305
ECO5CPDB     X54008    75341    49384
ECO5ERNAA    M16640    75382    49393
etc......

2) This list can now be used as input for the FEATURES program, which
will extract all protein coding sequences (CDS) as shown in the user
menu below:

___________________________________________________________________
                     FEATURES - Version   15 May 92                  
___________________________________________________________________
Features:  CDS
Entries:   ECO.nam
Database:  /home/psgendb/GenBank/gbbct
___________________________________________________________________
   Parameter              Description                      Value
-------------------------------------------------------------------
1).................... FEATURES TO EXTRACT ....................> f
  f:Type a feature at the keyboard 
  F:Read a list of features from a file
2)....................ENTRIES TO BE PROCESSED (choose one).....> N
  Keyboard input - n:name     a:accession #     e:expression
  File input     - N:name(s)  A:accession #(s)  E:expression(s)
3)....................WHERE TO GET IT .........................> u
  u:User-defined database subset   g:complete GenBank database
4)....................WHERE TO SEND IT ........................> a
  s:Each feature to a separate file  a:All output to same file
   ---------------------------------------------------------------
   Type number of your choice or 0 to continue:
0
Messages will be written to ECO.msg
Final sequence output will be written to  ECO.out
Expressions will be written to ECO.exp
Extracting features...
 
and there are now four files in our directory:

-rw-------  1 psgendb     95507 Jun 26 10:17 ECO.exp
-rw-------  1 psgendb   1248705 Jun 26 10:19 ECO.msg
-rw-------  1 psgendb     72770 Jun 26 10:03 ECO.nam
-rw-------  1 psgendb   2374474 Jun 26 10:19 ECO.out
 
Since ECO.out contains the DNA sequences for each CDS, it is quite
straightforward to look for all sequences beginning with atga. You 
could write a fairly simple program that searched the .out file and
wrote a new namefile with the names of those sequences beginning with
atga. You could almost use grep to do this, since

egrep -n ^atga ECO.out >atga.out

writes a file containing numbered output of all lines starting with atga:

109:atgacaaagttgcagccgaatacagtgatccgtgccgccctggacctgtt
179:atgagccagcaagtcattattttcgataccacattgcgcgacggtgaaca
185:atgactcattccacggcaatggattctgtttttatcagaacccgtatctt
210:atgatgcattgcataccgtgggtggtattgatcatgtattagttcgtcat
225:atgaccgaacgacgaacaatctggcaaagtactgcccaaatgccactgtt
291:atgatggaaaactataaacatactacggtgctgctggatgaagccgttaa
311:atgatcagcagagtgacagaagctctaagcaaagttaaaggatcgatggg
320:atgaaagcagcggcgaaaacgcagaaaccaaaacgtcaggaagaacatgc
388:atgattagcgtaacccttagccaacttaccgacattctcaacggtgaact
516:atgaatacacaacaattggcaaaactgcgttccatcgtgcccgaaatgcg
etc...

There are 1319 such lines in ECO.out.  While most of these are probably the
beginning of a CDS, it is best to have a program eliminate the false
positives for you. 




More information about the Bio-soft mailing list