Searching GenBank for nucleotide repeats

James Tisdall tisdall at amalthea.humgen.upenn.edu
Thu Mar 3 08:46:45 EST 1994


In article <keith-010394135254 at mac08.biochem.ualberta.ca> keith at bones.biochem.ualberta.ca (Keith Robinson) writes:
>A staff member here wants to search all of the rodent cDNA sequences in
>GenBank to get an idea of the frequency of occurence of:
>
> - Single base repeats (e.g. GGGGGG) of lengths 6 to 20 inclusive 
>    for all 4 possible combinations
>
> - double-base repeats (e.g. GAGAGAGAGAGA) of length 6 to 20, for
>    all 16 possible combinations
>
> - triple-base repeats (e.g. GATGATGATGATGATGAT) of length 6 to 20,
>    for all 64 possible combinations
>
>We use GCG here, and it is possible to perform this search with GCG's
>"findpatterns" command (e.g. searching for G repeats can be done with
>the pattern ~GG{6,20}~G), but this is time consuming (human and computer),
>and processing the resulting output file is rather tedious. Before 
>attempting to write our own program, does anyone know of any software 
>which would make setting up and interpreting results of these searches 
>easier?
>
> Keith Robinson             Dept. of Biochemistry
> The University of Alberta  Edmonton, Alberta Canada

This was relatively straightforward using "DNA WorkBench", from ftp site
cbil.humgen.upenn.edu:/pub/dnaworkbench
I used the Unix version.

I've appended the output of the program, followed by the program.  I put
the program in a file called "repeats" and executed the command
dnaworkbench -q -f repeats

The program included some start and end timing information - it took
about an hour and a half to run.

If the last line of the program is  uncommented, by removing the
leading "#" character, the program runs and leaves you in interactive
mode, with all the search results available for further processing.

(By length 6 to 20, I assume you want that number of repeats of the
groups, not nucleotide length - e.g. the set of trinucleotides acg repeated
6 to 20 times, would not include "acgacgacg" of length 9.
If I am assuming incorrectly, and you wanted the length of each search
string to be between 6 and 20 nucleotides in length, just edit the program
so that the dinucleotides are in the form e.g. (AT){3,10} and the
trinucleotides are in the form e.g. (ATG){2,7}.)

I extract the GenBank cDNA sequences by simply searching for the
string cDNA (or c-DNA) over the gbrod.seq library; then for each
set of hits on a given repeat sequence, I take the intersection of
the hits with the cDNA set.  My simple method of extracting the cDNA
sequences no doubt includes some false positives, and may be missing
some cDNA's as well - but to "get an idea of the frequency of occurence"
of these repeats in rodent cDNA, I expect it's fairly accurate.

It often is not necessary to search for all e.g. 64 trinucleotide
repeats.  For instance, AAA is already a repeat, and will be found
by finding repeats of A.  Also, ACT and CTA and TAC are "rotations"
of each other, so you only really have to look for one of them.  Following
are the "interesting" repeats up to length 3 (send me mail for a program
that calculates these, up to length 8).  Note that there are only 30 of
these, as opposed to 4+16+64= 84 of all permutations.  Of course, it
depends on how exact you're trying to be - you may find  some sequences
for which there are e.g. AAG repeats of length 6 which are not GAA repeats
of length 6.  Also, it is sometimes only necessary to search for just one
of a pair of complementary fragments such as ACT and TGA.

### Table of non-periodic strings over {a,c,g,t} modulo rotations, up to
### length 3.
A
C
G
T
AC
AG
AT
CG
CT
GT
AAC
AAG
AAT
ACC
ACG
ACT
AGC
AGG
AGT
ATC
ATG
ATT
CCG
CCT
CGG
CGT
CTG
CTT
GGT
GTT

Jim
======================================================================
James Tisdall

Departments of Genetics and Computer and Information Science
Computational Biology and Informatics Laboratory, Human Genome Project
University of Pennsylvania
tisdall at cbil.humgen.upenn.edu 215-573-3113 fax 215-573-3111

Biocomputing Associates
(610) 933-9266
======================================================================


#####Cut here: start of program output########
Wed Mar  2 19:11:55 EST 1994
Wed Mar  2 20:53:37 EST 1994
   "HELP SEARCHES" for instructions to LOAD, SAVE, WRITE, and RETRIEVE
    To combine searches use "HELP" for INTERSECTION UNION DIFFERENCE
           SEARCHES
   Size  Number  Command
_____________________________________
   8158       1  text c-?dna (in gbrod)
   5268       2  sequence A{6,20} (in gbrod)
   2368       3  intersect  1 $
   2818       4  sequence C{6,20} (in gbrod)
   1257       5  intersect  1 $
   3027       6  sequence G{6,20} (in gbrod)
   1232       7  intersect  1 $
   4365       8  sequence T{6,20} (in gbrod)
   1789       9  intersect  1 $
   1162      10  sequence (AA){6,20} (in gbrod)
    562      11  intersect  1 $
    519      12  sequence (AC){6,20} (in gbrod)
    114      13  intersect  1 $
    248      14  sequence (AG){6,20} (in gbrod)
     66      15  intersect  1 $
    191      16  sequence (AT){6,20} (in gbrod)
     61      17  intersect  1 $
    525      18  sequence (CA){6,20} (in gbrod)
    117      19  intersect  1 $
    141      20  sequence (CC){6,20} (in gbrod)
     37      21  intersect  1 $
     17      22  sequence (CG){6,20} (in gbrod)
      2      23  intersect  1 $
    244      24  sequence (CT){6,20} (in gbrod)
     65      25  intersect  1 $
    247      26  sequence (GA){6,20} (in gbrod)
     57      27  intersect  1 $
     18      28  sequence (GC){6,20} (in gbrod)
      3      29  intersect  1 $
    121      30  sequence (GG){6,20} (in gbrod)
     30      31  intersect  1 $
    535      32  sequence (GT){6,20} (in gbrod)
    106      33  intersect  1 $
    196      34  sequence (TA){6,20} (in gbrod)
     62      35  intersect  1 $
    248      36  sequence (TC){6,20} (in gbrod)
     65      37  intersect  1 $
    546      38  sequence (TG){6,20} (in gbrod)
    117      39  intersect  1 $
    663      40  sequence (TT){6,20} (in gbrod)
    239      41  intersect  1 $
    483      42  sequence (AAA){6,20} (in gbrod)
    240      43  intersect  1 $
     14      44  sequence (AAC){6,20} (in gbrod)
      4      45  intersect  1 $
     24      46  sequence (AAG){6,20} (in gbrod)
     16      47  intersect  1 $
     15      48  sequence (AAT){6,20} (in gbrod)
      4      49  intersect  1 $
     14      50  sequence (ACA){6,20} (in gbrod)
      4      51  intersect  1 $
     32      52  sequence (ACC){6,20} (in gbrod)
     15      53  intersect  1 $
      1      54  sequence (ACG){6,20} (in gbrod)
      0      55  intersect  1 $
      2      56  sequence (ACT){6,20} (in gbrod)
      1      57  intersect  1 $
     24      58  sequence (AGA){6,20} (in gbrod)
     16      59  intersect  1 $
     57      60  sequence (AGC){6,20} (in gbrod)
     33      61  intersect  1 $
     49      62  sequence (AGG){6,20} (in gbrod)
     18      63  intersect  1 $
      2      64  sequence (AGT){6,20} (in gbrod)
      1      65  intersect  1 $
     15      66  sequence (ATA){6,20} (in gbrod)
      4      67  intersect  1 $
      8      68  sequence (ATC){6,20} (in gbrod)
      0      69  intersect  1 $
      6      70  sequence (ATG){6,20} (in gbrod)
      2      71  intersect  1 $
     19      72  sequence (ATT){6,20} (in gbrod)
      4      73  intersect  1 $
     14      74  sequence (CAA){6,20} (in gbrod)
      4      75  intersect  1 $
     33      76  sequence (CAC){6,20} (in gbrod)
     15      77  intersect  1 $
     55      78  sequence (CAG){6,20} (in gbrod)
     30      79  intersect  1 $
      9      80  sequence (CAT){6,20} (in gbrod)
      1      81  intersect  1 $
     28      82  sequence (CCA){6,20} (in gbrod)
     12      83  intersect  1 $
     15      84  sequence (CCC){6,20} (in gbrod)
      3      85  intersect  1 $
     13      86  sequence (CCG){6,20} (in gbrod)
      8      87  intersect  1 $
     26      88  sequence (CCT){6,20} (in gbrod)
      5      89  intersect  1 $
      0      90   (CGA){6,20} (in gbrod)
      0      91  intersect  1 $
     18      92  sequence (CGC){6,20} (in gbrod)
      9      93  intersect  1 $
     14      94  sequence (CGG){6,20} (in gbrod)
      7      95  intersect  1 $
      0      96   (CGT){6,20} (in gbrod)
      0      97  intersect  1 $
      2      98  sequence (CTA){6,20} (in gbrod)
      1      99  intersect  1 $
     24     100  sequence (CTC){6,20} (in gbrod)
      4     101  intersect  1 $
     44     102  sequence (CTG){6,20} (in gbrod)
     15     103  intersect  1 $
     15     104  sequence (CTT){6,20} (in gbrod)
      4     105  intersect  1 $
     24     106  sequence (GAA){6,20} (in gbrod)
     16     107  intersect  1 $
      2     108  sequence (GAC){6,20} (in gbrod)
      0     109  intersect  1 $
     50     110  sequence (GAG){6,20} (in gbrod)
     20     111  intersect  1 $
     10     112  sequence (GAT){6,20} (in gbrod)
      6     113  intersect  1 $
     56     114  sequence (GCA){6,20} (in gbrod)
     30     115  intersect  1 $
     19     116  sequence (GCC){6,20} (in gbrod)
     10     117  intersect  1 $
     15     118  sequence (GCG){6,20} (in gbrod)
      7     119  intersect  1 $
     38     120  sequence (GCT){6,20} (in gbrod)
     13     121  intersect  1 $
     51     122  sequence (GGA){6,20} (in gbrod)
     18     123  intersect  1 $
     20     124  sequence (GGC){6,20} (in gbrod)
     10     125  intersect  1 $
     13     126  sequence (GGG){6,20} (in gbrod)
      4     127  intersect  1 $
     12     128  sequence (GGT){6,20} (in gbrod)
      1     129  intersect  1 $
      3     130  sequence (GTA){6,20} (in gbrod)
      1     131  intersect  1 $
      0     132   (GTC){6,20} (in gbrod)
      0     133  intersect  1 $
     11     134  sequence (GTG){6,20} (in gbrod)
      1     135  intersect  1 $
     25     136  sequence (GTT){6,20} (in gbrod)
     10     137  intersect  1 $
     15     138  sequence (TAA){6,20} (in gbrod)
      4     139  intersect  1 $
      2     140  sequence (TAC){6,20} (in gbrod)
      1     141  intersect  1 $
      2     142  sequence (TAG){6,20} (in gbrod)
      1     143  intersect  1 $
     19     144  sequence (TAT){6,20} (in gbrod)
      4     145  intersect  1 $
      9     146  sequence (TCA){6,20} (in gbrod)
      1     147  intersect  1 $
     26     148  sequence (TCC){6,20} (in gbrod)
      5     149  intersect  1 $
      1     150  sequence (TCG){6,20} (in gbrod)
      0     151  intersect  1 $
     15     152  sequence (TCT){6,20} (in gbrod)
      4     153  intersect  1 $
      9     154  sequence (TGA){6,20} (in gbrod)
      5     155  intersect  1 $
     38     156  sequence (TGC){6,20} (in gbrod)
     12     157  intersect  1 $
     12     158  sequence (TGG){6,20} (in gbrod)
      2     159  intersect  1 $
     25     160  sequence (TGT){6,20} (in gbrod)
     10     161  intersect  1 $
     19     162  sequence (TTA){6,20} (in gbrod)
      4     163  intersect  1 $
     15     164  sequence (TTC){6,20} (in gbrod)
      4     165  intersect  1 $
     26     166  sequence (TTG){6,20} (in gbrod)
      9     167  intersect  1 $
    189     168  sequence (TTT){6,20} (in gbrod)
     49     169  intersect  1 $
#####Cut here: end of program output########
#####Cut here: start of DNA WorkBench program ########
date
text c-?dna gbrod
sequence A{6,20}  gbrod ; intersection 1 $
sequence C{6,20}  gbrod ; intersection 1 $
sequence G{6,20}  gbrod ; intersection 1 $
sequence T{6,20}  gbrod ; intersection 1 $
sequence (AA){6,20}  gbrod ; intersection 1 $
sequence (AC){6,20}  gbrod ; intersection 1 $
sequence (AG){6,20}  gbrod ; intersection 1 $
sequence (AT){6,20}  gbrod ; intersection 1 $
sequence (CA){6,20}  gbrod ; intersection 1 $
sequence (CC){6,20}  gbrod ; intersection 1 $
sequence (CG){6,20}  gbrod ; intersection 1 $
sequence (CT){6,20}  gbrod ; intersection 1 $
sequence (GA){6,20}  gbrod ; intersection 1 $
sequence (GC){6,20}  gbrod ; intersection 1 $
sequence (GG){6,20}  gbrod ; intersection 1 $
sequence (GT){6,20}  gbrod ; intersection 1 $
sequence (TA){6,20}  gbrod ; intersection 1 $
sequence (TC){6,20}  gbrod ; intersection 1 $
sequence (TG){6,20}  gbrod ; intersection 1 $
sequence (TT){6,20}  gbrod ; intersection 1 $
sequence (AAA){6,20}  gbrod ; intersection 1 $
sequence (AAC){6,20}  gbrod ; intersection 1 $
sequence (AAG){6,20}  gbrod ; intersection 1 $
sequence (AAT){6,20}  gbrod ; intersection 1 $
sequence (ACA){6,20}  gbrod ; intersection 1 $
sequence (ACC){6,20}  gbrod ; intersection 1 $
sequence (ACG){6,20}  gbrod ; intersection 1 $
sequence (ACT){6,20}  gbrod ; intersection 1 $
sequence (AGA){6,20}  gbrod ; intersection 1 $
sequence (AGC){6,20}  gbrod ; intersection 1 $
sequence (AGG){6,20}  gbrod ; intersection 1 $
sequence (AGT){6,20}  gbrod ; intersection 1 $
sequence (ATA){6,20}  gbrod ; intersection 1 $
sequence (ATC){6,20}  gbrod ; intersection 1 $
sequence (ATG){6,20}  gbrod ; intersection 1 $
sequence (ATT){6,20}  gbrod ; intersection 1 $
sequence (CAA){6,20}  gbrod ; intersection 1 $
sequence (CAC){6,20}  gbrod ; intersection 1 $
sequence (CAG){6,20}  gbrod ; intersection 1 $
sequence (CAT){6,20}  gbrod ; intersection 1 $
sequence (CCA){6,20}  gbrod ; intersection 1 $
sequence (CCC){6,20}  gbrod ; intersection 1 $
sequence (CCG){6,20}  gbrod ; intersection 1 $
sequence (CCT){6,20}  gbrod ; intersection 1 $
sequence (CGA){6,20}  gbrod ; intersection 1 $
sequence (CGC){6,20}  gbrod ; intersection 1 $
sequence (CGG){6,20}  gbrod ; intersection 1 $
sequence (CGT){6,20}  gbrod ; intersection 1 $
sequence (CTA){6,20}  gbrod ; intersection 1 $
sequence (CTC){6,20}  gbrod ; intersection 1 $
sequence (CTG){6,20}  gbrod ; intersection 1 $
sequence (CTT){6,20}  gbrod ; intersection 1 $
sequence (GAA){6,20}  gbrod ; intersection 1 $
sequence (GAC){6,20}  gbrod ; intersection 1 $
sequence (GAG){6,20}  gbrod ; intersection 1 $
sequence (GAT){6,20}  gbrod ; intersection 1 $
sequence (GCA){6,20}  gbrod ; intersection 1 $
sequence (GCC){6,20}  gbrod ; intersection 1 $
sequence (GCG){6,20}  gbrod ; intersection 1 $
sequence (GCT){6,20}  gbrod ; intersection 1 $
sequence (GGA){6,20}  gbrod ; intersection 1 $
sequence (GGC){6,20}  gbrod ; intersection 1 $
sequence (GGG){6,20}  gbrod ; intersection 1 $
sequence (GGT){6,20}  gbrod ; intersection 1 $
sequence (GTA){6,20}  gbrod ; intersection 1 $
sequence (GTC){6,20}  gbrod ; intersection 1 $
sequence (GTG){6,20}  gbrod ; intersection 1 $
sequence (GTT){6,20}  gbrod ; intersection 1 $
sequence (TAA){6,20}  gbrod ; intersection 1 $
sequence (TAC){6,20}  gbrod ; intersection 1 $
sequence (TAG){6,20}  gbrod ; intersection 1 $
sequence (TAT){6,20}  gbrod ; intersection 1 $
sequence (TCA){6,20}  gbrod ; intersection 1 $
sequence (TCC){6,20}  gbrod ; intersection 1 $
sequence (TCG){6,20}  gbrod ; intersection 1 $
sequence (TCT){6,20}  gbrod ; intersection 1 $
sequence (TGA){6,20}  gbrod ; intersection 1 $
sequence (TGC){6,20}  gbrod ; intersection 1 $
sequence (TGG){6,20}  gbrod ; intersection 1 $
sequence (TGT){6,20}  gbrod ; intersection 1 $
sequence (TTA){6,20}  gbrod ; intersection 1 $
sequence (TTC){6,20}  gbrod ; intersection 1 $
sequence (TTG){6,20}  gbrod ; intersection 1 $
sequence (TTT){6,20}  gbrod ; intersection 1 $
date
searches
#perl $Quiet=0; $Commandline=0; #remove first '#' to stay in interactive mode
#####Cut here: end of DNA WorkBench program ########




More information about the Bio-soft mailing list