ANNOUNCE: tacg - Restriction Enzyme analysis tool for unix

Harry Mangalam mangalam at uci.edu
Fri Mar 8 01:24:55 EST 1996


                                  tacg 
            a program for the restriction enzyme analysis of DNA
                              Release 1.33

                        by Harry Mangalam, UC Irvine 
                      (mangalam at uci.edu, 714 824 4824)
         
# This posting is to announce the availability of 'tacg', a command line tool 
for the restriction enzyme analysis of DNA for unix-like operating systems.  
Binaries currently exist for IRIX (5.3), SunOS (5.3), OSF/1 (V3.0/347), and 
Linux(1.2.8); others will be made available as I find systems on which to 
compile them, or as others contribute binaries.

# For the impatient, here's an example of how to use it:

tail +44 seq.file | tacg -n 6 -o 5 -F 2 -l ladder.map -w 90 >seq.file.map

Translation: chop off the top 44 lines of seq.file and pipe the resulting 
sequence to tacg, returning info on all 6+ cutters (-n 6) that generate 5' 
overlaps (-o 5), giving me the sorted fragment sizes of those enz's that 
match (-F 2) and a ladder map (-l ladder.map), along with the default linear 
restriction map w/ 1 letter, 1 frame translation and write the output 90 
characters wide (-w 90) to a file called seq.file.map

# If you're interested in using it, you can get it via anonymous ftp at:

ftp://mamba.bio.uci.edu/pub/tacg

# The source code is freely available for instructional and nonprofit 
purposes, although since it is presently in beta release, I would suggest 
that anyone contemplating incorporating it would wait for the next release
while 
more bugs are shaken out.  Assuming it's used a fair bit, I'd like to have 
a chance to change it based on responses, document it more extensively and 
neaten it up before general release.

# This is citation-ware.  If you use it, please allow it to spit back about 
100 bytes of data so I can analyze its use and spread.  You can check the
source 
code (especially udping.xx.c) to see what it does and if it still makes 
you uneasy, you can disable it from the command-line or recompilation.

# The design criteria were: 

1) Simplicity. 
   It requires only 3 files - the executable, the restriction enzyme database 
file (rebase.data), and the codon usage file (codon.prefs).  The 2 data 
files are ascii text and can be edited and modified by the user, if 
required.
   It was designed along the same lines as other small unix utilities - a tool 
that does a small set of things, does them reasonably well and can be 
chained to other utilities or used in conjuction with them to extract the 
information you need without too much fuss.
   The output of this program uses only alphanumeric characters so that all of 
its output can be viewed on a vanilla vt100-like terminal, although you can 
do more useful things if you're using an X display.  For instance, some of 
the output can best be viewed using very small fonts or in multiple columns 
on a page, generated by feeding the output to a postscript conversion 
package (lptops, enscript, nenscript, genscript, etc).
      
2) High Portability 
   The program is written in vanilla ANSI C, with no arcane ifdefs.  It 
compiles with few complaints on SGI's IRIX (5.3) with cc, Sparcs running 
SunOS (5.3) with cc and gcc, DEC Alphas running OSF1 (v3.0) with cc, and 
*especially* Linux (1.2.8) with gcc.  

3) Speed and Capacity
   The program uses a hashtable-lookup of the restriction enzyme recognition 
sites (generated on the fly) so that only about half of the sequence is 
checked any further than the initial hash.  Depending on what kind of 
output you request and the i/o of the machine (output is by far the most 
time-consuming part of the program), the program processes:

Speed*               Hardware             OS             Compiler, flags
 ~14-150Kb/s         i486/66/ISA          Linux 1.2.8    gcc -O2
 ~16-80Kb/s          Sparc 4/?MHz         SunOS 5.3      gcc -O
 ~25-130Kb/s         early DEC Alpha      OSF/1          gcc -O2
 ~23-260Kb/s         R4000/100 Indigo2    IRIX 5.3       cc -O2 -mips2
 ~94-700Kb/s         R4400/200 Indigo2    IRIX 5.3       cc -O2 -mips2

   It also uses dynamic memory allocation so that while there are a few 
hard-coded limitations (in output format), it easily handles sequences into 
the millions of bases.  

4) Usability
   Inspired by Christian Marck's elegant DNA Strider, I used a similiar output 
format, changing a few things I didn't like, adding a few things I wanted.  

The Feature Set:
a) produces linear restriction maps. 

   The map shows EXACT cutting position (not just the start of the recognition 
sequence - minor nitpick with Strider and other programs), with same-page 
translation (ditto) in 1/3/6 frames in 1 or 3 letter codes.  tested up to 
more than a million bases.  ie:

============================================================================
                MspI                                                            
                HpaII                                                           
             Sau96I                                                             
             AvaII                                                              
             RsrII       BstUI          FokI  MaeII   EcoRV                     
             \  \        \              \     \       \                         
  13981   agcggtccggctgtcgcggatgaatatgaccagccaacgtccgatatcacgaaggataaa  14040   
          tcgccaggccgacagcgcctacttatactggtcggttgcaggctatagtgcttcctattt          
              ^    *    ^    *    ^    *    ^    *    ^    *    ^    *          
          S  G  P  A  V  A  D  E  Y  D  Q  P  T  S  D  I  T  K  D  K            
============================================================================
b) filters enzymes inclusively by:
   - magnitude of recognition sequence (tgca=4, tgyrca=5, tgcnnngca=6, etc)
   - overlap of resulting ends (5', 3', blunt)
   - minimum, maximum times they cut the sequence

b) handles linear/circular topologies, subsequences

c) produces Summaries of cuts:
============================================================================
 Restriction Enzymes that DO NOT CUT in this sequence:

      BbeI      EheI      FseI      KasI      NarI      NheI      NotI
      PacI    PaeR7I      SalI      SfiI      SpeI      SwaI      XhoI

 Total Number of Cuts per Restriction Enzyme:

     AatII    5     BsiYI  130     EcoNI    5      MluI    7      SalI    0
      AccI    5      BsmI   30  EcoO109I    2      MmeI    8      SapI    7
     AflII    2     BsmAI   26     EcoRI    3      MnlI  184      SauI    1
    AflIII   13   Bsp120I    1    EcoRII   49      MscI   17    Sau96I   61
      AgeI   12  Bsp1286I   26     EcoRV   14      MseI  106      ScaI    4
      AluI   89     BspEI   22      EheI    0      MspI  278     ScrFI  145
                              <etc>
============================================================================

   - Tables of cutting sites. ie:
      (for enzymes that pass the filtering options)
============================================================================
  **  Cut Sites by Restriction Enzyme **

AatII       G_ACGT'C - 5 cut(s)
   5110   9399  11248  14979  29041

AccI        GT'mk_AC - 5 cut(s)
   2192  15262  18836  19475  31303

AflII       C'TTAA_G - 2 cut(s)
   6541  12619

AflIII      A'CryG_T - 13 cut(s)
    459    629   5549  11282  15373  17792  18285  19997  20953  22221  24134
  24169  26529
============================================================================  

   - Tables of fragment sizes (unsorted, sorted or both) ie:
============================================================================
  **  SORTED Fragment Sizes by Restriction Enzyme **

AatII       G_ACGT'C - 5 Fragment(s)
   1849   3449   3731   4289   5110  14062

AccI        GT'mk_AC - 5 Fragment(s)
    639   1187   2192   3574  11828  13070

AflII       C'TTAA_G - 2 Fragment(s)
   6078   6541  19871

AflIII      A'CryG_T - 13 Fragment(s)
     35    170    459    493    956   1268   1712   1913   2360   2419   4091
   4920   5733   5961
============================================================================

   - Ladder map, with 5', 3' blunt cutters indicated (\, /, |)
============================================================================
  Ladder Map of Restriction Enzyme Cut Sites:  

                    10000          20000          30000          40000     
                        :              :              :              :      
     AccI ---\-----\----------------------------------------------\--------
   AceIII ---------------------------\---------------\---------------\---\-
     AciI ----\--\\2-\2--\--2-\\-2\--\\-\\-\\-\--\-\-\----2\--\--\-\-\-\---
    AflII --\--\--\\\\-----------------\\\-----------\\------\-------------
                        :              :              :              :      
   AflIII --------------\-----------\------------\------\\\2---------------
     AhdI -------/---------------------------------------------------------
     AluI |3355323833353|43284|44-|45|324|54252-3426|2|22|52543|42323522|22
     AlwI -2\--\2---3----2-----------------\\-\22-2\--\\\\-\---2--\-\-----2
                        :              :              :              :      
============================================================================
   
   - A summary map (a la Strider) of enzymes that cut less than 2 times 
      (altho this may be changed to be length-sensitive)
============================================================================
Summary of Enzymes that cut ** 2 ** times or less:
                                                                                    

       
XhoI at 5733                                                                   

      Pfl1108I at 4774              PvuI at 22499
PshAI at 29823                              
   DrdI at 2598                           
NarI at 27578                                   
      BssSI at 4941                 BsiEI at 22499           
BssSI at 38186                  
       AhdI at 5113                        BsaHI at 27578    
BsmBI at 37969                  
   |  |||                        |      |   |          
|                            
---------------------------------------------------------------------------         

              :              :              :              :             
:          
          10000          20000          30000          40000         
50000          
============================================================================

   - A pseudo gel format that shows how different digests would look if run 
      on a gel.  Currently, it uses a straight log10() approximation, but a 
      suggestion was made to use an additional transformation to mimic 
      different percentages of agarose/polyacryamide.  It uses the same  
      representation as the ladder map, with single fragments represented as
      '|', multiple fragments that cannot be resolved as a digit showing how 
      many map to that space
============================================================================
  Pseudo-Gel Map of Digestions:    *Maximum* Cuts: 50

        100                                                     1000        
          .                .         .      .     .   .   .  .  .  .        
     AccI                                                                   
   AceIII                                                                   
     AciI |          |||               |    |      | 2|||   2|   2 |    ||| 
    AflII                                         ||   |  |  | |            
          .                .         .      .     .   .   .  .  .  .        
   AflIII                                |       ||       |                 
     AhdI                                                                   
     AlwI 7             ||             |       ||  ||    |   |   |  2   |  2
   Alw26I                                                |                  
============================================================================

d) Other options:
   - extract subsequences from the input sequence (and make 
      circular/linear)
   - translations with linear restriction map in 1, 3, or 6 frames, 
      with 1 or 3 letter codes
- Choose which of several codon preferences to use 
- 'Write/don't write' most of the options
- User-settable printing widths to ~200 characters 

5) The Odd one: 

   It was also designed to track it's own use and spread - something in which 
I'm also interested.  To that end, the binaries have been compiled with 
code that spits a small amount of data back to me at each usage, telling me 
the IP number of the hosting machine, the UID of person using it, the cpu 
type and OS it was run on, what flags were used in calling it, and the 
number of bases processed.  It does not return host or domain names, user 
names, or actual sequence.  The exact data that is returned is shown on 
stderr (usually the screen) each time.

Cheers
Harry
-- 
Harry J Mangalam, Microbiology and Molecular Genetics, UC Irvine,
      Irvine, CA, 92717, (714) 824-4824, fax (714) 824 8598
            http://hornet.mmg.uci.edu/~hjm/hjm.html
  Computational Biology..SGI..Woodworking..Bicycling..Linux..WWW 




More information about the Bionews mailing list