GCG manual

To: Nikolai troianovsk_s at MSDISK.WUSTL.EDU
Mon Apr 3 16:28:39 EST 1995

Content-Type: text/plain; charset="us-ascii"

>I want to learn how to use GCG program.  If someone knows good manuals or 
>textbooks, please let me know.  
>dyryu at unity.ncsu.edu

Hi, there.

here it is.


Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

HELP GUIDE FOR GCG PACKAGE                           =20

The   APPENDICES  contain  reference  tables,   forms,  and  detailed
explanations of several basic concepts of the GCG Package.

ASSEMBLE  makes  new  sequence  constructs from  pieces  of  existing
sequences.  It concatenates the fragments you specify and writes=
 them out
as a new sequence file.  SEQED is more powerful than ASSEMBLE for=

BACKTRANSLATE backtranslates an amino acid sequence into a
nucleotide sequence.  The output display helps you recognize minimally
ambiguous regions that may be good for constructing synthetic probes.

The  BATCH QUEUE section of the USER'S GUIDE describes how to run
GCG programs and  command procedures  in the background,  which =
you continue to use your terminal.=20

BESTFIT  makes an optimal alignment of the best segment of similarity
between two  sequences.   Optimal  alignments are  found by inserting
gaps  to  maximize the  number of  matches using  the  local homology
algorithm of Smith and Waterman.
CHOPUP reads files with lines up to 32,000-characters long.  The=
is  rewritten to a  new  file  that  has  lines  no  longer  than=

CIRCLES  uses an output file  from  FOLD to  make a circular Nussinov
plot of an RNA secondary structure.

 CODONFREQUENCY  tabulates codon usage from sequences and/or=20
existing codon usage tables.  The output file is correctly formatted=
input to the CODONPREFERENCE, CORRESPOND, and FRAMES programs.
CODONPREFERENCE  is  a  frame-specific  gene  finder  that  tries=
recognize  protein coding sequences  by virtue  of the  similarity=
 of their
codon usage to a codon  frequency table or by the bias of their composition
(usually GC)  in the third position of each codon.

COMPARE  compares two protein or nucleic acid sequences and creates
a file of the  points  of similarity  between  them  for  plotting=
DOTPLOT.  COMPARE finds the points using  either a  window/stringency=
a word match criterion.  The word comparison is 1,000 times faster=
the window/stringency comparison, but somewhat less sensitive.

COMPOSITION   determines  the   composition  of   sequence(s). For
nucleotide sequence(s), COMPOSITION also determines dinucleotide=
trinucleotide content.

COMPRESSTEXT  removes  any  or  all  of  the  following  from  files:=
blank  lines;   2) trailing space;   3) extra space between words;=
 or 4) all

COMPTABLE  creates  a  symbol  comparison  table  using  equivalences
defined in a simplification scheme such as the one used for SIMPLIFY.=
the SYMBOL COMPARISON TABLES  section of the USER'S  GUIDE for more

CONSENSUS  calculates a consensus sequence  for a  set of pre-aligned
short nucleic acid sequences by tabulating  the percent of G,  A,=
  T, and C
for each position in the  set.  FITCONSENSUS uses the CONSENSUS output
table  as  a probe to  search  for the  best  examples of  the derived
consensus in other nucleotide sequences.

CORRESPOND  looks  for similar  patterns  of codon usage by comparing
codon frequency tables.

CORRUPT   randomly  introduces   small  numbers   of   substitutions,
insertions, and deletions into nucleotide sequence(s).

COUNT  counts the number  of  characters, words,  and  lines in =
NOTE:    The  documentation for  this program was not  ready when=
PROGRAM  MANUAL  went to  press.   We  plan   to  include  complete=
documentation with one of the incremental updates to Version 7.=20
Printed:  April 15, 1991  16:35 (1162)

CRYPT writes an encrypted version of a file using a key word that=
choose.  Run CRYPT a second time with the same keyword to restore=
encrypted output file to its original state.

DATASET  creates a GCG data library from any set of  sequences in=

DBINDEX  generates the index files needed to  access entries in a=
data library.  The input file specification to DBINDEX is one or=
 more data
library sequence (.Seq)  files, such as Globin.Seq.

DETAB  replaces the tab characters in one  or more files with spaces.
The files  can  be written out in  card-image format with  records=
 of fixed

DISTANCES  makes a table of the pair-wise distances within a group=
aligned sequences.

DIVERGE  measures  the  percent  divergence  of  two  protein  coding
sequences using the method of Perler and Efstratiadis.

DOMES uses an output file from FOLD to make a linear plot of a folded
RNA molecule.

DOTPLOT  makes a dot-plot with the output file from COMPARE, FOLD,

ECHO  shows the decimal value and printing representation of each=
you press or type from the terminal.  Stop ECHO by using CTRL-Y.
NOTE:   The  documentation for this  program  was  not ready when=
PROGRAM  MANUAL  went to  press.   We   plan   to  include  complete=
documentation with one of the incremental updates to Version 7.=20
Printed:  April 15, 1991  16:49 (1162)

EXAMINE  counts the number of characters in each line of a file.=
/ALL on the command line to show every character value on each line.
NOTE:   The  documentation for this  program  was  not ready when=
PROGRAM  MANUAL  went to  press.   We   plan   to  include  complete=
documentation with one of the incremental updates to Version 7.=20
Printed:  April 15, 1991  16:49 (1162)

EXTRACTPEPTIDE  writes a  peptide sequence  from one or more  of=
translation  frames  displayed  in the  output  from MAP.   TRANSLATE
supercedes EXTRACTPEPTIDE for most applications.

=46ASTA does a Pearson and Lipman search for similarity between a
query sequence  and any group of  sequences.  FASTA  answers the
question, "What  sequences in  the database  are similar  to my=20
sequence?" The relationship between FASTA and WORDSEARCH has not
been characterized, but FASTA is faster and, for some searches, more

=46ETCH  copies GCG sequences or  data files from the GCG database=
your directory or displays them on your terminal screen.

=46IGURE  makes  figures  and  posters  by  drawing  graphics and=
together.  You can include output from other GCG graphics programs=
part of a figure.

=46ILECHECK recognizes the identity of two files or checks the accuracy
of  a file transfer by calculating  a unique checksum based on all=
 of the
characters in a file.
NOTE:   The  documentation for this  program  was  not ready when=
PROGRAM  MANUAL  went to  press.   We   plan   to  include  complete=
documentation with one of the incremental updates to Version 7.=20
Printed:  April 15, 1991  16:49 (1162)

=46INDPATTERNS  identifies sequences with  short  pattern  queries=
GAATTC  or YRYRYRYR.   You  can define  the patterns  ambiguously=
allow mismatches.  You  can provide the  patterns in a file or simply=
them in from the terminal.

=46INGERPRINT identifies the products of T1 ribonuclease digestion.

=46ITCONSENSUS  uses a consensus  table written by CONSENSUS as a
probe to find  the best examples of the consensus in a  DNA  sequence.=
can specify the  number of fits you  want to  see,  and  FITCONSENSUS
tabulates them with their position,  frame, and a statistical measure=
their quality.

=46OLD  finds an optimal secondary structure for  an RNA molecule=
 up to
1,200-bases long by the method of Zuker.

=46ONTS  draws  tables showing each character in the software-
generated fonts available to GCG graphics programs.
NOTE:   The  documentation for this  program  was  not ready when=
PROGRAM  MANUAL  went to  press.   We   plan   to  include  complete=
documentation with one of the incremental updates to Version 7.=20
Printed:  April 15, 1991  16:49 (1162)

=46RAMES  shows open reading frames for the six translation frames=
 of a
DNA sequence.   FRAMES  can  superimpose  the pattern  of  rare codon
choices if you provide it with a codon frequency table.

=46ROMEMBL  reformats  sequences  from  the  distribution  (flat=
format of the EMBL Data Library into individual sequence files in=

=46ROMGENBANK  reformats one or more sequences in  the flat file=
of the GenBank  data  library into individual sequence  files  in=

=46ROMIG   reformats   sequences  from   IntelliGenetics  format=
individual files in GCG format.

=46ROMPIR  reformats sequences from the protein database of the Protein
Identification Resource (PIR)  into individual files in GCG format.

=46ROMSTADEN changes a sequence from Staden format into GCG format.=
If the  file  contains a nucleotide sequence,  the ambiguity  codes=

GAP  uses the algorithm of Needleman and Wunsch to find the alignment
of two  complete sequences that  maximizes the number  of matches=
minimizes the number of gaps.

GAPSHOW  displays  an alignment by  making  a  graph that  shows=
distribution of  similarities and  gaps.   The  two  input  sequences=
be aligned with either GAP or BESTFIT before they are given to GAPSHOW
for display.

GELASSEMBLE  is  a multiple  sequence editor  for  putting  sequences
together into assemblies called contigs.

GELDISASSEMBLE  breaks up the  contigs in a fragment assembly
project into single fragments.

GELENTER  adds fragment sequences to a fragment assembly project.=
accepts sequence  data from your  terminal keyboard,  a digitizer,=
existing sequence files.

GELOVERLAP  compares the sequences in a fragment assembly project
and writes out a list of the points of overlap.  The output  file=
 is used by

GELASSEMBLE to load  sequences into  the  editor at  the positions=
you are likely to want to assemble them.

GELSTART  begins  a  fragment  assembly  session  by  creating a=
fragment assembly project or by identifying an existing project.

GELVIEW  displays the structure of the existing contigs in a fragment
assembly project.  Run with  /CLUster  on  the  command line, GELVIEW
displays the contigs as they would appear  after assembling  the=
overlaps found by GELOVERLAP.

GETSEQ  reads a sequence  from another computer  acting as a terminal
and creates the same sequence in GCG format on the VAX.

GETTEXT  reads a text file from another computer acting as a terminal
and creates a new  text  file  on the  VAX with the same contents=

HELICALWHEEL  plots a peptide sequence as a helical wheel to help=
recognize amphiphilic regions.

ISOELECTRIC  plots the charge as  a function of  pH for  any  peptide

LINEUP  is a screen editor for editing  multiple sequence alignments.
You can edit up to 30 sequences simultaneously.  New sequences can=
typed  in by hand or added from existing sequence files.  A consensus
sequence identifies places where the sequences are in conflict.

LISTFILE  prints a  file  on  a  printer attached to  your terminal's=
through printer port.

LPRINT  prints text  file(s)  on a PostScript  printer  connected=

MAP  displays both  strands of a DNA sequence with a  restriction=
shown  above  the  sequence  and possible protein translations  shown

MAPPLOT  displays restriction sites graphically.  If you don't have=
plotter, MAPPLOT can write a text file that approximates the graph.

MAPSORT finds the coordinates of the restriction enzyme cuts in a=
sequence  and sorts  the fragments  of the resulting digest  by size.
MAPSORT  can  sort  the  fragments  from  single  or multiple  enzyme

MOMENT  makes a  contour plot of the helical  hydrophobic moment=
 of a
peptide sequence.

MOTIFS  looks  for sequence motifs by searching through proteins=
the  patterns defined in the PROSITE Dictionary of  Protein Sites=
Patterns.  MOTIFS  can display an abstract of the current  literature=
each of the motifs it finds.

MOUNTAINS  uses  an  output file from FOLD to make a  plot of  an=
secondary structure.

NAMES identifies GCG data files and sequence entries by name.  It=
show   you  what  set  of  sequences  is  implied  by   any  sequence

The NEW USERS section gives an overview of how to use the VAX and
the GCG Package.  It explains the subset of VAX/VMS commands and
concepts that  are useful for biologists, and it introduces the basic
concepts used by the GCG Package.

ONECASE puts all of the alphabetic characters in a file into lower=
UPPER case.  It can also capitalize every word.

OVERLAP  compares two sets  of DNA sequences  to each other  in =
orientations using a WORDSEARCH style comparison.

PEPDATA translates DNA sequence(s)  in all six frames.

PEPPLOT   plots  measures   of   protein   secondary  structure and
hydrophobicity in parallel panels of the same plot.

PEPTIDEMAP creates a peptide map of an amino acid sequence.

PEPTIDESORT  shows the  peptide  fragments from  a digest of an amino
acid sequence.  It sorts  the peptides  by weight, position, and=
retention at pH 2.1,  and shows the composition of each peptide.=
  It also
prints a summary of the composition of the whole protein.

PEPTIDESTRUCTURE  makes secondary structure predictions for a
peptide sequence.  The predictions include (in addition to alpha,=
 beta, coil,
and turn)   measures  for antigenicity,  flexibility, hydrophobicity,=
surface  probability.   PLOTSTRUCTURE displays  the  predictions

PILEUP  creates a multiple sequence alignment from a group of related
sequences using progressive, pairwise alignments.  It can also plot=
 a tree=20
showing  the   clustering  relationships  used  to  create  the alignment.

PLASMIDMAP  draws a  circular  plot  of  a plasmid construct.  It=
display restriction  patterns,  inserts, and known genetic  elements.=
plot is  suitable for  publication, record keeping,  or analysis.=
 It is  drawn
from one or more labeling files such as those written by MAPSORT.

PLOTMETAFILE plots a GKS metafile on any GKS supported device.

PLOTSIMILARITY  plots the running average of the similarity among=
sequences in a multiple sequence alignment.

PLOTSTRUCTURE  plots  the measures  of protein secondary structure=
the output file  from PEPTIDESTRUCTURE.  The measures can be shown=
parallel  panels  of  a  graph or  with  a  two-dimensional  squiggly

PLOTTEST  plots a test  pattern  to see if your plotter is configured
properly.  The  test  pattern uses  every GCG  graphics  feature.=
 It should
be similar to the one in the PROGRAM MANUAL.

PRETTY   displays  multiple  sequence  alignments  and  calculates=
consensus  sequence.  It  does not create  the alignment;   it simply
displays it.

PROFILEGAP  makes  an  optimal alignment  between  a  profile  and=

PROFILEMAKE  creates  a  position-specific scoring  table,  called=
profile, that quantitatively represents the information from a  group=
aligned sequences.  The profile  can  then  be used for  database=
(PROFILESEARCH)  or sequence alignment (PROFILEGAP).

PROFILESCAN  uses a database of profiles to find structural motifs=
protein sequences.

PROFILESEARCH  uses  a  profile  (representing  a  group  of  aligned
sequences)   as a probe to search the database for new sequences=
similarity to the group.  The  profile  is created  with  the program

PROFILESEGMENTS  makes optimal alignments  showing  the  segments
of similarity found by PROFILESEARCH.

PUBLISH  arranges sequences for publication.  It creates a  text=
that you can modify to your own needs with a text editor.

QUICKINDEX  builds hash  tables  from sequence(s)  in data libraries.
These tables make  up the database  that is  searched by QUICKSEARCH.
GCG  provides  hash tables for searching  GenEMBL  so you do not=
QUICKINDEX unless you  have a large number of your own sequences=
you want to search with QUICKSEARCH.
NOTE:   The GCG Quick Searching  System programs are only prototypes!=
We are  distributing them in  the hope that you will make suggestions=
about their future development.

QUICKSEARCH  rapidly  identifies the places  where  query sequence(s)
occur  in a  nucleotide  sequence database.   The output is a file=
overlaps that  can be displayed with the QUICKSHOW  program.  You=
make up your own sequence database or  use GenEMBL, which consists=
GenBank  and those  sequences in  EMBL  that  are not  represented=
NOTE:   The GCG Quick Searching System  programs are only prototypes!=
We are distributing them in  the hope that  you will make suggestions=
about their future development.

QUICKSHOW  displays  the overlaps found  by QUICKSEARCH  with  either
dot-plots  or  optimal  alignments.   The  dot-plots  can be reviewed
rapidly with a graphic screen.
NOTE:   The  GCG Quick Searching System programs are only prototypes!=
We are distributing them in the hope that  you  will make suggestions=
about their future development.

RED is a text formatter that creates publication-quality documents=
a  PostScript laser printer such  as the Apple  LaserWriter.  You=
 can use 13
different fonts, scaling each font to any size.  You  can also include
figures and graphics from any GCG graphics program within the text=
the  document.  All GCG  documentation,  correspondence, and publication
is done with RED.

REFORMAT  rewrites sequence file(s),  symbol  comparison table(s),=
enzyme data file(s)  so that they can be read by GCG programs.

REPEAT  finds direct repeats  in  sequences.  You must set  the size,
stringency, and range  within which the repeat  must  occur;  all=
repeats of that size or greater are displayed as short alignments.

REPLACE  makes character  string replacements in text  file(s). =
provide  a  table of replacements in  a  file  showing  each existing=
and its replacement.

REVERSE reverses and/or complements a sequence.
SAMPLE  extracts sequence fragments randomly  from sequence(s). =
can  set  a  sampling  rate  to determine how  many fragments  SAMPLE

SEGMENTS  aligns and displays  the segments  of  similarity  found=

SEQED  is an interactive editor for  entering and modifying sequences
and  for  assembling  parts of existing  sequences  into  new genetic
constructs.  You  can  enter sequences  from  the keyboard or  from=

SETKEYS   writes  a  file  in  your  directory  that  redefines =
keyboard's keys for sequence entry  with the programs  SEQED, LINEUP,
GELENTER, and GELASSEMBLE.   The output file, called Set.Keys, can=
edited  if  you  want to  use  keys that were  not  defined  in =
interactive session with SETKEYS.

SHIFT moves a file to the right or to the left as many columns as=

SHUFFLE  randomizes the order of  the symbols  in a sequence, keeping
the composition constant.

SIMPLIFY  simplifies  peptide or  nucleic acid sequences  into  broad

  The  SPECIFYING  SEQUENCES  section  tells  how  to  name  and=
sequences. With  Version 7.0 of the GCG Package there are over 65,000
nucleotide and protein  sequences  available  to GCG programs.  Each
sequence in the  data collections  from  GenBank,  EMBL,  SwissProt=
PIR is identified  in  one  of  the  volumes  of  the  DATA  REFERENCE=
Additionally, all  of your  personal  sequences are available  to=
This section of the  USER'S  GUIDE tells  you how  to  find all these
sequence data and how to use them with GCG programs, concentrating
especially on how to search for and name sequences.

SPEW  sends a GCG sequence from the VAX to a personal computer
acting as a terminal.

SQUIGGLES  uses an output file from  FOLD to  make a  plot of an=
secondary structure.

STATPLOT  plots a set of parallel curves from a table of numbers=
the  table  written by the  WINDOW program.   The  statistics in=
column of the table  are associated  with a position in the  analyzed

STEMLOOP  finds stems  (inverted repeats)   within  a  sequence.=
specify the minimum stem length, minimum and maximum  loop sizes,=
the minimum  number of  bonds  per stem.  All loops or only  the=
 best loops
can be displayed on your screen or written into a file.

STRINGSEARCH identifies sequences by searching sequence
documentation with character patterns such as 'globin' or 'human'.

TERMINATOR searches for prokaryotic factor-independent RNA
polymerase terminators according to the method of Brendel and Trifonov.

TESTCODE  helps you  identify protein coding sequences by  plotting=
measure of the non-randomness of the composition at every third base.
The statistic does not require a codon frequency table.

TFASTA  does  a Pearson and Lipman  search for similarity  between=
query peptide sequence and any group of nucleotide sequences.  TFASTA
translates  the  nucleotide  sequences  in  all  six  frames   before
performing  the comparison.  It is  designed  to answer the question,=
implied peptide sequences in a nucleotide sequence database are similar
to my peptide sequence?"

TOIG   converts   GCG  sequence  file(s)   into  a   single  file=
IntelliGenetics format.

TOPIR writes GCG sequence(s)  into a single file in PIR format.

TOSTADEN  writes a GCG sequence into a file in Staden format.  If=
file  contains  a  nucleotide   sequence,  the  ambiguity  codes=
TRANSLATE translates nucleotide sequences into peptide sequences.

WINDOW  makes  a  table  of  the  frequencies of  different  sequence
patterns within a window as it is moved along a sequence.   A pattern=
any  short sequence like GC or R or ATG.  You can plot the  output=
 with the
program STATPLOT.

WORDSEARCH  identifies sequences similar  to a query sequence using=
Wilbur  and Lipman-style  search.   WORDSEARCH  answers the question,
"What sequences in  the  database are similar  to  my  sequence?"=
output is a  list of  significant diagonals whose alignments  can=
displayed with SEGMENTS.


More information about the Methods mailing list