GCG manual
To: Nikolai
troianovsk_s at MSDISK.WUSTL.EDU
Mon Apr 3 16:28:39 EST 1995
--========================_37469634==_
Content-Type: text/plain; charset="us-ascii"
>Hi.
>
>I want to learn how to use GCG program. If someone knows good manuals or
>textbooks, please let me know.
>
>Thanks.
>
>dyryu at unity.ncsu.edu
Hi, there.
here it is.
Nikolai.
--========================_37469634==_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
HELP GUIDE FOR GCG PACKAGE =20
The APPENDICES contain reference tables, forms, and detailed
explanations of several basic concepts of the GCG Package.
ASSEMBLE makes new sequence constructs from pieces of existing
sequences. It concatenates the fragments you specify and writes=
them out
as a new sequence file. SEQED is more powerful than ASSEMBLE for=
most
applications.
BACKTRANSLATE backtranslates an amino acid sequence into a
nucleotide sequence. The output display helps you recognize minimally
ambiguous regions that may be good for constructing synthetic probes.
The BATCH QUEUE section of the USER'S GUIDE describes how to run
GCG programs and command procedures in the background, which =
lets
you continue to use your terminal.=20
BESTFIT makes an optimal alignment of the best segment of similarity
between two sequences. Optimal alignments are found by inserting
gaps to maximize the number of matches using the local homology
algorithm of Smith and Waterman.
=20
CHOPUP reads files with lines up to 32,000-characters long. The=
file
is rewritten to a new file that has lines no longer than=
50
characters.
CIRCLES uses an output file from FOLD to make a circular Nussinov
plot of an RNA secondary structure.
CODONFREQUENCY tabulates codon usage from sequences and/or=20
existing codon usage tables. The output file is correctly formatted=
for
input to the CODONPREFERENCE, CORRESPOND, and FRAMES programs.
=20
CODONPREFERENCE is a frame-specific gene finder that tries=
to
recognize protein coding sequences by virtue of the similarity=
of their
codon usage to a codon frequency table or by the bias of their composition
(usually GC) in the third position of each codon.
COMPARE compares two protein or nucleic acid sequences and creates
a file of the points of similarity between them for plotting=
with
DOTPLOT. COMPARE finds the points using either a window/stringency=
or
a word match criterion. The word comparison is 1,000 times faster=
than
the window/stringency comparison, but somewhat less sensitive.
COMPOSITION determines the composition of sequence(s). For
nucleotide sequence(s), COMPOSITION also determines dinucleotide=
and
trinucleotide content.
COMPRESSTEXT removes any or all of the following from files:=
1)
blank lines; 2) trailing space; 3) extra space between words;=
or 4) all
space.
COMPTABLE creates a symbol comparison table using equivalences
defined in a simplification scheme such as the one used for SIMPLIFY.=
(See=20
the SYMBOL COMPARISON TABLES section of the USER'S GUIDE for more
information.)
CONSENSUS calculates a consensus sequence for a set of pre-aligned
short nucleic acid sequences by tabulating the percent of G, A,=
T, and C
for each position in the set. FITCONSENSUS uses the CONSENSUS output
table as a probe to search for the best examples of the derived
consensus in other nucleotide sequences.
CORRESPOND looks for similar patterns of codon usage by comparing
codon frequency tables.
CORRUPT randomly introduces small numbers of substitutions,
insertions, and deletions into nucleotide sequence(s).
COUNT counts the number of characters, words, and lines in =
text
file(s).
NOTE: The documentation for this program was not ready when=
the=20
PROGRAM MANUAL went to press. We plan to include complete=
=20
documentation with one of the incremental updates to Version 7.=20
Printed: April 15, 1991 16:35 (1162)
CRYPT writes an encrypted version of a file using a key word that=
you
choose. Run CRYPT a second time with the same keyword to restore=
the
encrypted output file to its original state.
DATASET creates a GCG data library from any set of sequences in=
GCG
format.
DBINDEX generates the index files needed to access entries in a=
GCG
data library. The input file specification to DBINDEX is one or=
more data
library sequence (.Seq) files, such as Globin.Seq.
DETAB replaces the tab characters in one or more files with spaces.
The files can be written out in card-image format with records=
of fixed
length.
DISTANCES makes a table of the pair-wise distances within a group=
of
aligned sequences.
DIVERGE measures the percent divergence of two protein coding
sequences using the method of Perler and Efstratiadis.
DOMES uses an output file from FOLD to make a linear plot of a folded
RNA molecule.
DOTPLOT makes a dot-plot with the output file from COMPARE, FOLD,
or STEMLOOP.
ECHO shows the decimal value and printing representation of each=
key
you press or type from the terminal. Stop ECHO by using CTRL-Y.
NOTE: The documentation for this program was not ready when=
the
PROGRAM MANUAL went to press. We plan to include complete=
=20
documentation with one of the incremental updates to Version 7.=20
Printed: April 15, 1991 16:49 (1162)
EXAMINE counts the number of characters in each line of a file.=
Use
/ALL on the command line to show every character value on each line.
NOTE: The documentation for this program was not ready when=
the=20
PROGRAM MANUAL went to press. We plan to include complete=
=20
documentation with one of the incremental updates to Version 7.=20
Printed: April 15, 1991 16:49 (1162)
EXTRACTPEPTIDE writes a peptide sequence from one or more of=
the
translation frames displayed in the output from MAP. TRANSLATE
supercedes EXTRACTPEPTIDE for most applications.
=46ASTA does a Pearson and Lipman search for similarity between a
query sequence and any group of sequences. FASTA answers the
question, "What sequences in the database are similar to my=20
sequence?" The relationship between FASTA and WORDSEARCH has not
been characterized, but FASTA is faster and, for some searches, more
sensitive.
=46ETCH copies GCG sequences or data files from the GCG database=
into
your directory or displays them on your terminal screen.
=46IGURE makes figures and posters by drawing graphics and=
text
together. You can include output from other GCG graphics programs=
as
part of a figure.
=46ILECHECK recognizes the identity of two files or checks the accuracy
of a file transfer by calculating a unique checksum based on all=
of the
characters in a file.
NOTE: The documentation for this program was not ready when=
the=20
PROGRAM MANUAL went to press. We plan to include complete=
=20
documentation with one of the incremental updates to Version 7.=20
Printed: April 15, 1991 16:49 (1162)
=46INDPATTERNS identifies sequences with short pattern queries=
like
GAATTC or YRYRYRYR. You can define the patterns ambiguously=
and
allow mismatches. You can provide the patterns in a file or simply=
type
them in from the terminal.
=46INGERPRINT identifies the products of T1 ribonuclease digestion.
=46ITCONSENSUS uses a consensus table written by CONSENSUS as a
probe to find the best examples of the consensus in a DNA sequence.=
You
can specify the number of fits you want to see, and FITCONSENSUS
tabulates them with their position, frame, and a statistical measure=
of
their quality.
=46OLD finds an optimal secondary structure for an RNA molecule=
up to
1,200-bases long by the method of Zuker.
=46ONTS draws tables showing each character in the software-
generated fonts available to GCG graphics programs.
NOTE: The documentation for this program was not ready when=
the=20
PROGRAM MANUAL went to press. We plan to include complete=
=20
documentation with one of the incremental updates to Version 7.=20
Printed: April 15, 1991 16:49 (1162)
=46RAMES shows open reading frames for the six translation frames=
of a
DNA sequence. FRAMES can superimpose the pattern of rare codon
choices if you provide it with a codon frequency table.
=46ROMEMBL reformats sequences from the distribution (flat=
file)
format of the EMBL Data Library into individual sequence files in=
GCG
format.
=46ROMGENBANK reformats one or more sequences in the flat file=
format
of the GenBank data library into individual sequence files in=
GCG
format.
=46ROMIG reformats sequences from IntelliGenetics format=
into
individual files in GCG format.
=46ROMPIR reformats sequences from the protein database of the Protein
Identification Resource (PIR) into individual files in GCG format.
=46ROMSTADEN changes a sequence from Staden format into GCG format.=
=20
If the file contains a nucleotide sequence, the ambiguity codes=
are
translated.
GAP uses the algorithm of Needleman and Wunsch to find the alignment
of two complete sequences that maximizes the number of matches=
and
minimizes the number of gaps.
GAPSHOW displays an alignment by making a graph that shows=
the
distribution of similarities and gaps. The two input sequences=
should
be aligned with either GAP or BESTFIT before they are given to GAPSHOW
for display.
GELASSEMBLE is a multiple sequence editor for putting sequences
together into assemblies called contigs.
GELDISASSEMBLE breaks up the contigs in a fragment assembly
project into single fragments.
GELENTER adds fragment sequences to a fragment assembly project.=
It
accepts sequence data from your terminal keyboard, a digitizer,=
or
existing sequence files.
GELOVERLAP compares the sequences in a fragment assembly project
and writes out a list of the points of overlap. The output file=
is used by
GELASSEMBLE to load sequences into the editor at the positions=
where
you are likely to want to assemble them.
GELSTART begins a fragment assembly session by creating a=
new
fragment assembly project or by identifying an existing project.
GELVIEW displays the structure of the existing contigs in a fragment
assembly project. Run with /CLUster on the command line, GELVIEW
displays the contigs as they would appear after assembling the=
new
overlaps found by GELOVERLAP.
GETSEQ reads a sequence from another computer acting as a terminal
and creates the same sequence in GCG format on the VAX.
GETTEXT reads a text file from another computer acting as a terminal
and creates a new text file on the VAX with the same contents=
and
format.
HELICALWHEEL plots a peptide sequence as a helical wheel to help=
you
recognize amphiphilic regions.
ISOELECTRIC plots the charge as a function of pH for any peptide
sequence.
LINEUP is a screen editor for editing multiple sequence alignments.
You can edit up to 30 sequences simultaneously. New sequences can=
be
typed in by hand or added from existing sequence files. A consensus
sequence identifies places where the sequences are in conflict.
LISTFILE prints a file on a printer attached to your terminal's=
pass-
through printer port.
LPRINT prints text file(s) on a PostScript printer connected=
to
LPrintPort.
MAP displays both strands of a DNA sequence with a restriction=
map
shown above the sequence and possible protein translations shown
below.
MAPPLOT displays restriction sites graphically. If you don't have=
a
plotter, MAPPLOT can write a text file that approximates the graph.
MAPSORT finds the coordinates of the restriction enzyme cuts in a=
DNA
sequence and sorts the fragments of the resulting digest by size.
MAPSORT can sort the fragments from single or multiple enzyme
digests.
MOMENT makes a contour plot of the helical hydrophobic moment=
of a
peptide sequence.
MOTIFS looks for sequence motifs by searching through proteins=
for
the patterns defined in the PROSITE Dictionary of Protein Sites=
and
Patterns. MOTIFS can display an abstract of the current literature=
on
each of the motifs it finds.
MOUNTAINS uses an output file from FOLD to make a plot of an=
RNA
secondary structure.
NAMES identifies GCG data files and sequence entries by name. It=
can
show you what set of sequences is implied by any sequence
specification.
The NEW USERS section gives an overview of how to use the VAX and
the GCG Package. It explains the subset of VAX/VMS commands and
concepts that are useful for biologists, and it introduces the basic
concepts used by the GCG Package.
ONECASE puts all of the alphabetic characters in a file into lower=
or
UPPER case. It can also capitalize every word.
OVERLAP compares two sets of DNA sequences to each other in =
both
orientations using a WORDSEARCH style comparison.
PEPDATA translates DNA sequence(s) in all six frames.
PEPPLOT plots measures of protein secondary structure and
hydrophobicity in parallel panels of the same plot.
PEPTIDEMAP creates a peptide map of an amino acid sequence.
PEPTIDESORT shows the peptide fragments from a digest of an amino
acid sequence. It sorts the peptides by weight, position, and=
HPLC
retention at pH 2.1, and shows the composition of each peptide.=
It also
prints a summary of the composition of the whole protein.
PEPTIDESTRUCTURE makes secondary structure predictions for a
peptide sequence. The predictions include (in addition to alpha,=
beta, coil,
and turn) measures for antigenicity, flexibility, hydrophobicity,=
and=20
surface probability. PLOTSTRUCTURE displays the predictions
graphically.
PILEUP creates a multiple sequence alignment from a group of related
sequences using progressive, pairwise alignments. It can also plot=
a tree=20
showing the clustering relationships used to create the alignment.
PLASMIDMAP draws a circular plot of a plasmid construct. It=
can
display restriction patterns, inserts, and known genetic elements.=
The
plot is suitable for publication, record keeping, or analysis.=
It is drawn
from one or more labeling files such as those written by MAPSORT.
PLOTMETAFILE plots a GKS metafile on any GKS supported device.
PLOTSIMILARITY plots the running average of the similarity among=
the
sequences in a multiple sequence alignment.
PLOTSTRUCTURE plots the measures of protein secondary structure=
in
the output file from PEPTIDESTRUCTURE. The measures can be shown=
on
parallel panels of a graph or with a two-dimensional squiggly
representation.
PLOTTEST plots a test pattern to see if your plotter is configured
properly. The test pattern uses every GCG graphics feature.=
It should
be similar to the one in the PROGRAM MANUAL.
PRETTY displays multiple sequence alignments and calculates=
a
consensus sequence. It does not create the alignment; it simply
displays it.
PROFILEGAP makes an optimal alignment between a profile and=
a
sequence.
PROFILEMAKE creates a position-specific scoring table, called=
a
profile, that quantitatively represents the information from a group=
of=20
aligned sequences. The profile can then be used for database=
searching
(PROFILESEARCH) or sequence alignment (PROFILEGAP).
PROFILESCAN uses a database of profiles to find structural motifs=
in
protein sequences.
PROFILESEARCH uses a profile (representing a group of aligned
sequences) as a probe to search the database for new sequences=
with
similarity to the group. The profile is created with the program
PROFILEMAKE.
PROFILESEGMENTS makes optimal alignments showing the segments
of similarity found by PROFILESEARCH.
PUBLISH arranges sequences for publication. It creates a text=
file
that you can modify to your own needs with a text editor.
QUICKINDEX builds hash tables from sequence(s) in data libraries.
These tables make up the database that is searched by QUICKSEARCH.
GCG provides hash tables for searching GenEMBL so you do not=
need
QUICKINDEX unless you have a large number of your own sequences=
that
you want to search with QUICKSEARCH.
NOTE: The GCG Quick Searching System programs are only prototypes!=
=20
We are distributing them in the hope that you will make suggestions=
=20
about their future development.
QUICKSEARCH rapidly identifies the places where query sequence(s)
occur in a nucleotide sequence database. The output is a file=
of
overlaps that can be displayed with the QUICKSHOW program. You=
can
make up your own sequence database or use GenEMBL, which consists=
of
GenBank and those sequences in EMBL that are not represented=
in
GenBank.
NOTE: The GCG Quick Searching System programs are only prototypes!=
=20
We are distributing them in the hope that you will make suggestions=
=20
about their future development.
QUICKSHOW displays the overlaps found by QUICKSEARCH with either
dot-plots or optimal alignments. The dot-plots can be reviewed
rapidly with a graphic screen.
NOTE: The GCG Quick Searching System programs are only prototypes!=
=20
We are distributing them in the hope that you will make suggestions=
=20
about their future development.
RED is a text formatter that creates publication-quality documents=
on
a PostScript laser printer such as the Apple LaserWriter. You=
can use 13
different fonts, scaling each font to any size. You can also include
figures and graphics from any GCG graphics program within the text=
of
the document. All GCG documentation, correspondence, and publication
is done with RED.
REFORMAT rewrites sequence file(s), symbol comparison table(s),=
or
enzyme data file(s) so that they can be read by GCG programs.
REPEAT finds direct repeats in sequences. You must set the size,
stringency, and range within which the repeat must occur; all=
the
repeats of that size or greater are displayed as short alignments.
REPLACE makes character string replacements in text file(s). =
You
provide a table of replacements in a file showing each existing=
string
and its replacement.
REVERSE reverses and/or complements a sequence.
SAMPLE extracts sequence fragments randomly from sequence(s). =
You
can set a sampling rate to determine how many fragments SAMPLE
extracts.
SEGMENTS aligns and displays the segments of similarity found=
by
WORDSEARCH.
SEQED is an interactive editor for entering and modifying sequences
and for assembling parts of existing sequences into new genetic
constructs. You can enter sequences from the keyboard or from=
a
digitizer.
SETKEYS writes a file in your directory that redefines =
your
keyboard's keys for sequence entry with the programs SEQED, LINEUP,
GELENTER, and GELASSEMBLE. The output file, called Set.Keys, can=
be
edited if you want to use keys that were not defined in =
your
interactive session with SETKEYS.
SHIFT moves a file to the right or to the left as many columns as=
you
specify.
SHUFFLE randomizes the order of the symbols in a sequence, keeping
the composition constant.
SIMPLIFY simplifies peptide or nucleic acid sequences into broad
categories.
The SPECIFYING SEQUENCES section tells how to name and=
find
sequences. With Version 7.0 of the GCG Package there are over 65,000
nucleotide and protein sequences available to GCG programs. Each
sequence in the data collections from GenBank, EMBL, SwissProt=
and=20
PIR is identified in one of the volumes of the DATA REFERENCE=
SET.
Additionally, all of your personal sequences are available to=
GCG
programs.
This section of the USER'S GUIDE tells you how to find all these
sequence data and how to use them with GCG programs, concentrating
especially on how to search for and name sequences.
SPEW sends a GCG sequence from the VAX to a personal computer
acting as a terminal.
SQUIGGLES uses an output file from FOLD to make a plot of an=
RNA
secondary structure.
STATPLOT plots a set of parallel curves from a table of numbers=
like
the table written by the WINDOW program. The statistics in=
each
column of the table are associated with a position in the analyzed
sequence.
STEMLOOP finds stems (inverted repeats) within a sequence.=
You
specify the minimum stem length, minimum and maximum loop sizes,=
and
the minimum number of bonds per stem. All loops or only the=
best loops
can be displayed on your screen or written into a file.
STRINGSEARCH identifies sequences by searching sequence
documentation with character patterns such as 'globin' or 'human'.
TERMINATOR searches for prokaryotic factor-independent RNA
polymerase terminators according to the method of Brendel and Trifonov.
TESTCODE helps you identify protein coding sequences by plotting=
a
measure of the non-randomness of the composition at every third base.
The statistic does not require a codon frequency table.
TFASTA does a Pearson and Lipman search for similarity between=
a
query peptide sequence and any group of nucleotide sequences. TFASTA
translates the nucleotide sequences in all six frames before
performing the comparison. It is designed to answer the question,=
"What
implied peptide sequences in a nucleotide sequence database are similar
to my peptide sequence?"
TOIG converts GCG sequence file(s) into a single file=
in
IntelliGenetics format.
TOPIR writes GCG sequence(s) into a single file in PIR format.
TOSTADEN writes a GCG sequence into a file in Staden format. If=
the
file contains a nucleotide sequence, the ambiguity codes=
are
translated.=20
TRANSLATE translates nucleotide sequences into peptide sequences.
WINDOW makes a table of the frequencies of different sequence
patterns within a window as it is moved along a sequence. A pattern=
is
any short sequence like GC or R or ATG. You can plot the output=
with the
program STATPLOT.
WORDSEARCH identifies sequences similar to a query sequence using=
a
Wilbur and Lipman-style search. WORDSEARCH answers the question,
"What sequences in the database are similar to my sequence?"=
The
output is a list of significant diagonals whose alignments can=
be
displayed with SEGMENTS.
--========================_37469634==_--
More information about the Methods
mailing list