Searching GenBank for nucleotide repeats
James Tisdall
tisdall at amalthea.humgen.upenn.edu
Thu Mar 3 08:46:45 EST 1994
In article <keith-010394135254 at mac08.biochem.ualberta.ca> keith at bones.biochem.ualberta.ca (Keith Robinson) writes:
>A staff member here wants to search all of the rodent cDNA sequences in
>GenBank to get an idea of the frequency of occurence of:
>
> - Single base repeats (e.g. GGGGGG) of lengths 6 to 20 inclusive
> for all 4 possible combinations
>
> - double-base repeats (e.g. GAGAGAGAGAGA) of length 6 to 20, for
> all 16 possible combinations
>
> - triple-base repeats (e.g. GATGATGATGATGATGAT) of length 6 to 20,
> for all 64 possible combinations
>
>We use GCG here, and it is possible to perform this search with GCG's
>"findpatterns" command (e.g. searching for G repeats can be done with
>the pattern ~GG{6,20}~G), but this is time consuming (human and computer),
>and processing the resulting output file is rather tedious. Before
>attempting to write our own program, does anyone know of any software
>which would make setting up and interpreting results of these searches
>easier?
>
> Keith Robinson Dept. of Biochemistry
> The University of Alberta Edmonton, Alberta Canada
This was relatively straightforward using "DNA WorkBench", from ftp site
cbil.humgen.upenn.edu:/pub/dnaworkbench
I used the Unix version.
I've appended the output of the program, followed by the program. I put
the program in a file called "repeats" and executed the command
dnaworkbench -q -f repeats
The program included some start and end timing information - it took
about an hour and a half to run.
If the last line of the program is uncommented, by removing the
leading "#" character, the program runs and leaves you in interactive
mode, with all the search results available for further processing.
(By length 6 to 20, I assume you want that number of repeats of the
groups, not nucleotide length - e.g. the set of trinucleotides acg repeated
6 to 20 times, would not include "acgacgacg" of length 9.
If I am assuming incorrectly, and you wanted the length of each search
string to be between 6 and 20 nucleotides in length, just edit the program
so that the dinucleotides are in the form e.g. (AT){3,10} and the
trinucleotides are in the form e.g. (ATG){2,7}.)
I extract the GenBank cDNA sequences by simply searching for the
string cDNA (or c-DNA) over the gbrod.seq library; then for each
set of hits on a given repeat sequence, I take the intersection of
the hits with the cDNA set. My simple method of extracting the cDNA
sequences no doubt includes some false positives, and may be missing
some cDNA's as well - but to "get an idea of the frequency of occurence"
of these repeats in rodent cDNA, I expect it's fairly accurate.
It often is not necessary to search for all e.g. 64 trinucleotide
repeats. For instance, AAA is already a repeat, and will be found
by finding repeats of A. Also, ACT and CTA and TAC are "rotations"
of each other, so you only really have to look for one of them. Following
are the "interesting" repeats up to length 3 (send me mail for a program
that calculates these, up to length 8). Note that there are only 30 of
these, as opposed to 4+16+64= 84 of all permutations. Of course, it
depends on how exact you're trying to be - you may find some sequences
for which there are e.g. AAG repeats of length 6 which are not GAA repeats
of length 6. Also, it is sometimes only necessary to search for just one
of a pair of complementary fragments such as ACT and TGA.
### Table of non-periodic strings over {a,c,g,t} modulo rotations, up to
### length 3.
A
C
G
T
AC
AG
AT
CG
CT
GT
AAC
AAG
AAT
ACC
ACG
ACT
AGC
AGG
AGT
ATC
ATG
ATT
CCG
CCT
CGG
CGT
CTG
CTT
GGT
GTT
Jim
======================================================================
James Tisdall
Departments of Genetics and Computer and Information Science
Computational Biology and Informatics Laboratory, Human Genome Project
University of Pennsylvania
tisdall at cbil.humgen.upenn.edu 215-573-3113 fax 215-573-3111
Biocomputing Associates
(610) 933-9266
======================================================================
#####Cut here: start of program output########
Wed Mar 2 19:11:55 EST 1994
Wed Mar 2 20:53:37 EST 1994
"HELP SEARCHES" for instructions to LOAD, SAVE, WRITE, and RETRIEVE
To combine searches use "HELP" for INTERSECTION UNION DIFFERENCE
SEARCHES
Size Number Command
_____________________________________
8158 1 text c-?dna (in gbrod)
5268 2 sequence A{6,20} (in gbrod)
2368 3 intersect 1 $
2818 4 sequence C{6,20} (in gbrod)
1257 5 intersect 1 $
3027 6 sequence G{6,20} (in gbrod)
1232 7 intersect 1 $
4365 8 sequence T{6,20} (in gbrod)
1789 9 intersect 1 $
1162 10 sequence (AA){6,20} (in gbrod)
562 11 intersect 1 $
519 12 sequence (AC){6,20} (in gbrod)
114 13 intersect 1 $
248 14 sequence (AG){6,20} (in gbrod)
66 15 intersect 1 $
191 16 sequence (AT){6,20} (in gbrod)
61 17 intersect 1 $
525 18 sequence (CA){6,20} (in gbrod)
117 19 intersect 1 $
141 20 sequence (CC){6,20} (in gbrod)
37 21 intersect 1 $
17 22 sequence (CG){6,20} (in gbrod)
2 23 intersect 1 $
244 24 sequence (CT){6,20} (in gbrod)
65 25 intersect 1 $
247 26 sequence (GA){6,20} (in gbrod)
57 27 intersect 1 $
18 28 sequence (GC){6,20} (in gbrod)
3 29 intersect 1 $
121 30 sequence (GG){6,20} (in gbrod)
30 31 intersect 1 $
535 32 sequence (GT){6,20} (in gbrod)
106 33 intersect 1 $
196 34 sequence (TA){6,20} (in gbrod)
62 35 intersect 1 $
248 36 sequence (TC){6,20} (in gbrod)
65 37 intersect 1 $
546 38 sequence (TG){6,20} (in gbrod)
117 39 intersect 1 $
663 40 sequence (TT){6,20} (in gbrod)
239 41 intersect 1 $
483 42 sequence (AAA){6,20} (in gbrod)
240 43 intersect 1 $
14 44 sequence (AAC){6,20} (in gbrod)
4 45 intersect 1 $
24 46 sequence (AAG){6,20} (in gbrod)
16 47 intersect 1 $
15 48 sequence (AAT){6,20} (in gbrod)
4 49 intersect 1 $
14 50 sequence (ACA){6,20} (in gbrod)
4 51 intersect 1 $
32 52 sequence (ACC){6,20} (in gbrod)
15 53 intersect 1 $
1 54 sequence (ACG){6,20} (in gbrod)
0 55 intersect 1 $
2 56 sequence (ACT){6,20} (in gbrod)
1 57 intersect 1 $
24 58 sequence (AGA){6,20} (in gbrod)
16 59 intersect 1 $
57 60 sequence (AGC){6,20} (in gbrod)
33 61 intersect 1 $
49 62 sequence (AGG){6,20} (in gbrod)
18 63 intersect 1 $
2 64 sequence (AGT){6,20} (in gbrod)
1 65 intersect 1 $
15 66 sequence (ATA){6,20} (in gbrod)
4 67 intersect 1 $
8 68 sequence (ATC){6,20} (in gbrod)
0 69 intersect 1 $
6 70 sequence (ATG){6,20} (in gbrod)
2 71 intersect 1 $
19 72 sequence (ATT){6,20} (in gbrod)
4 73 intersect 1 $
14 74 sequence (CAA){6,20} (in gbrod)
4 75 intersect 1 $
33 76 sequence (CAC){6,20} (in gbrod)
15 77 intersect 1 $
55 78 sequence (CAG){6,20} (in gbrod)
30 79 intersect 1 $
9 80 sequence (CAT){6,20} (in gbrod)
1 81 intersect 1 $
28 82 sequence (CCA){6,20} (in gbrod)
12 83 intersect 1 $
15 84 sequence (CCC){6,20} (in gbrod)
3 85 intersect 1 $
13 86 sequence (CCG){6,20} (in gbrod)
8 87 intersect 1 $
26 88 sequence (CCT){6,20} (in gbrod)
5 89 intersect 1 $
0 90 (CGA){6,20} (in gbrod)
0 91 intersect 1 $
18 92 sequence (CGC){6,20} (in gbrod)
9 93 intersect 1 $
14 94 sequence (CGG){6,20} (in gbrod)
7 95 intersect 1 $
0 96 (CGT){6,20} (in gbrod)
0 97 intersect 1 $
2 98 sequence (CTA){6,20} (in gbrod)
1 99 intersect 1 $
24 100 sequence (CTC){6,20} (in gbrod)
4 101 intersect 1 $
44 102 sequence (CTG){6,20} (in gbrod)
15 103 intersect 1 $
15 104 sequence (CTT){6,20} (in gbrod)
4 105 intersect 1 $
24 106 sequence (GAA){6,20} (in gbrod)
16 107 intersect 1 $
2 108 sequence (GAC){6,20} (in gbrod)
0 109 intersect 1 $
50 110 sequence (GAG){6,20} (in gbrod)
20 111 intersect 1 $
10 112 sequence (GAT){6,20} (in gbrod)
6 113 intersect 1 $
56 114 sequence (GCA){6,20} (in gbrod)
30 115 intersect 1 $
19 116 sequence (GCC){6,20} (in gbrod)
10 117 intersect 1 $
15 118 sequence (GCG){6,20} (in gbrod)
7 119 intersect 1 $
38 120 sequence (GCT){6,20} (in gbrod)
13 121 intersect 1 $
51 122 sequence (GGA){6,20} (in gbrod)
18 123 intersect 1 $
20 124 sequence (GGC){6,20} (in gbrod)
10 125 intersect 1 $
13 126 sequence (GGG){6,20} (in gbrod)
4 127 intersect 1 $
12 128 sequence (GGT){6,20} (in gbrod)
1 129 intersect 1 $
3 130 sequence (GTA){6,20} (in gbrod)
1 131 intersect 1 $
0 132 (GTC){6,20} (in gbrod)
0 133 intersect 1 $
11 134 sequence (GTG){6,20} (in gbrod)
1 135 intersect 1 $
25 136 sequence (GTT){6,20} (in gbrod)
10 137 intersect 1 $
15 138 sequence (TAA){6,20} (in gbrod)
4 139 intersect 1 $
2 140 sequence (TAC){6,20} (in gbrod)
1 141 intersect 1 $
2 142 sequence (TAG){6,20} (in gbrod)
1 143 intersect 1 $
19 144 sequence (TAT){6,20} (in gbrod)
4 145 intersect 1 $
9 146 sequence (TCA){6,20} (in gbrod)
1 147 intersect 1 $
26 148 sequence (TCC){6,20} (in gbrod)
5 149 intersect 1 $
1 150 sequence (TCG){6,20} (in gbrod)
0 151 intersect 1 $
15 152 sequence (TCT){6,20} (in gbrod)
4 153 intersect 1 $
9 154 sequence (TGA){6,20} (in gbrod)
5 155 intersect 1 $
38 156 sequence (TGC){6,20} (in gbrod)
12 157 intersect 1 $
12 158 sequence (TGG){6,20} (in gbrod)
2 159 intersect 1 $
25 160 sequence (TGT){6,20} (in gbrod)
10 161 intersect 1 $
19 162 sequence (TTA){6,20} (in gbrod)
4 163 intersect 1 $
15 164 sequence (TTC){6,20} (in gbrod)
4 165 intersect 1 $
26 166 sequence (TTG){6,20} (in gbrod)
9 167 intersect 1 $
189 168 sequence (TTT){6,20} (in gbrod)
49 169 intersect 1 $
#####Cut here: end of program output########
#####Cut here: start of DNA WorkBench program ########
date
text c-?dna gbrod
sequence A{6,20} gbrod ; intersection 1 $
sequence C{6,20} gbrod ; intersection 1 $
sequence G{6,20} gbrod ; intersection 1 $
sequence T{6,20} gbrod ; intersection 1 $
sequence (AA){6,20} gbrod ; intersection 1 $
sequence (AC){6,20} gbrod ; intersection 1 $
sequence (AG){6,20} gbrod ; intersection 1 $
sequence (AT){6,20} gbrod ; intersection 1 $
sequence (CA){6,20} gbrod ; intersection 1 $
sequence (CC){6,20} gbrod ; intersection 1 $
sequence (CG){6,20} gbrod ; intersection 1 $
sequence (CT){6,20} gbrod ; intersection 1 $
sequence (GA){6,20} gbrod ; intersection 1 $
sequence (GC){6,20} gbrod ; intersection 1 $
sequence (GG){6,20} gbrod ; intersection 1 $
sequence (GT){6,20} gbrod ; intersection 1 $
sequence (TA){6,20} gbrod ; intersection 1 $
sequence (TC){6,20} gbrod ; intersection 1 $
sequence (TG){6,20} gbrod ; intersection 1 $
sequence (TT){6,20} gbrod ; intersection 1 $
sequence (AAA){6,20} gbrod ; intersection 1 $
sequence (AAC){6,20} gbrod ; intersection 1 $
sequence (AAG){6,20} gbrod ; intersection 1 $
sequence (AAT){6,20} gbrod ; intersection 1 $
sequence (ACA){6,20} gbrod ; intersection 1 $
sequence (ACC){6,20} gbrod ; intersection 1 $
sequence (ACG){6,20} gbrod ; intersection 1 $
sequence (ACT){6,20} gbrod ; intersection 1 $
sequence (AGA){6,20} gbrod ; intersection 1 $
sequence (AGC){6,20} gbrod ; intersection 1 $
sequence (AGG){6,20} gbrod ; intersection 1 $
sequence (AGT){6,20} gbrod ; intersection 1 $
sequence (ATA){6,20} gbrod ; intersection 1 $
sequence (ATC){6,20} gbrod ; intersection 1 $
sequence (ATG){6,20} gbrod ; intersection 1 $
sequence (ATT){6,20} gbrod ; intersection 1 $
sequence (CAA){6,20} gbrod ; intersection 1 $
sequence (CAC){6,20} gbrod ; intersection 1 $
sequence (CAG){6,20} gbrod ; intersection 1 $
sequence (CAT){6,20} gbrod ; intersection 1 $
sequence (CCA){6,20} gbrod ; intersection 1 $
sequence (CCC){6,20} gbrod ; intersection 1 $
sequence (CCG){6,20} gbrod ; intersection 1 $
sequence (CCT){6,20} gbrod ; intersection 1 $
sequence (CGA){6,20} gbrod ; intersection 1 $
sequence (CGC){6,20} gbrod ; intersection 1 $
sequence (CGG){6,20} gbrod ; intersection 1 $
sequence (CGT){6,20} gbrod ; intersection 1 $
sequence (CTA){6,20} gbrod ; intersection 1 $
sequence (CTC){6,20} gbrod ; intersection 1 $
sequence (CTG){6,20} gbrod ; intersection 1 $
sequence (CTT){6,20} gbrod ; intersection 1 $
sequence (GAA){6,20} gbrod ; intersection 1 $
sequence (GAC){6,20} gbrod ; intersection 1 $
sequence (GAG){6,20} gbrod ; intersection 1 $
sequence (GAT){6,20} gbrod ; intersection 1 $
sequence (GCA){6,20} gbrod ; intersection 1 $
sequence (GCC){6,20} gbrod ; intersection 1 $
sequence (GCG){6,20} gbrod ; intersection 1 $
sequence (GCT){6,20} gbrod ; intersection 1 $
sequence (GGA){6,20} gbrod ; intersection 1 $
sequence (GGC){6,20} gbrod ; intersection 1 $
sequence (GGG){6,20} gbrod ; intersection 1 $
sequence (GGT){6,20} gbrod ; intersection 1 $
sequence (GTA){6,20} gbrod ; intersection 1 $
sequence (GTC){6,20} gbrod ; intersection 1 $
sequence (GTG){6,20} gbrod ; intersection 1 $
sequence (GTT){6,20} gbrod ; intersection 1 $
sequence (TAA){6,20} gbrod ; intersection 1 $
sequence (TAC){6,20} gbrod ; intersection 1 $
sequence (TAG){6,20} gbrod ; intersection 1 $
sequence (TAT){6,20} gbrod ; intersection 1 $
sequence (TCA){6,20} gbrod ; intersection 1 $
sequence (TCC){6,20} gbrod ; intersection 1 $
sequence (TCG){6,20} gbrod ; intersection 1 $
sequence (TCT){6,20} gbrod ; intersection 1 $
sequence (TGA){6,20} gbrod ; intersection 1 $
sequence (TGC){6,20} gbrod ; intersection 1 $
sequence (TGG){6,20} gbrod ; intersection 1 $
sequence (TGT){6,20} gbrod ; intersection 1 $
sequence (TTA){6,20} gbrod ; intersection 1 $
sequence (TTC){6,20} gbrod ; intersection 1 $
sequence (TTG){6,20} gbrod ; intersection 1 $
sequence (TTT){6,20} gbrod ; intersection 1 $
date
searches
#perl $Quiet=0; $Commandline=0; #remove first '#' to stay in interactive mode
#####Cut here: end of DNA WorkBench program ########
