Instructions on Using GenBank E-mail servers
kristoff at GENBANK.BIO.NET
Tue Feb 20 19:28:48 EST 1990
The following can also be retrieved by sending a message containing
just the word HELP (no Subject: line) to search at genbank.bio.net.
Instructions for retrieving individual sequence entries are included
below, but can also be obtained by sending HELP to
retrieve at genbank.bio.net.
David Kristofferson, Ph.D.
GenBank On-line Service Manager
kristoff at genbank.bio.net
FASTA Server Help
GenBank now offers the FASTA program for nucleic acid sequence and
protein similarity searching of sequence databases. You can access
the GenBank FASTA Server through a number of different networks,
including Internet, BITNET, EARN, NETNORTH and JANET.
The FASTA program allows you to send a specially formatted mail
message containing the nucleic acid or protein query sequence to the
FASTA Server at GenBank. A FASTA sequence similarity search is then
performed against the specified database using the FASTA program
developed by William Pearson and David Lipman as described in their
Pearson, W.R. and Lipman, D.J. 1988. Improved Tools for
Biological Sequence Comparison. Proc. Natl. Acad. Sci.,
If you use FASTA as a research tool, we ask that this reference be
cited in your paper. The results of the FASTA search will be returned
to your local mail file as soon as they are processed and can be saved
in a separate disk file.
The following databases are currently available for FASTA searches:
GenBank/all Latest GenBank quarterly release PLUS
sequences added since last release.
GenBank/new GenBank sequences added since last release.
GenBank/primate GenBank subdivisions
GenPept/all Translated protein reading frames from
the latest GenBank release. Note that
GenPept contains translations only of
reading frames that are explicitly
mentioned in the GenBank sequence entry
GenPept/new Translated protein reading frames from
GenBank daily updates (translated from
EMBL/all Latest EMBL Data Library release PLUS
sequences added since last release.
EMBL/new EMBL sequences added since last release.
SWISS-PROT/all All of the SWISS-PROT protein database.
GenBank and EMBL are nucleic acid sequence databases and SWISS-PROT is
a protein sequence database. GenPept is produced by GenBank and
consists of translations of open reading frames as documented in the
sequence entry annotations ("pept" in features table).
Accessing the FASTA program
To access the program, send an electronic mail message containing the
formatted query sequence (as described below) to the following Internet
SEARCH at GENBANK.BIO.NET
If you are not on Internet, you may need to change the format of the
address. Consult your systems manager to determine the correct address.
If you would like to receive instructions on using the FASTA program,
send a mail message to the address above containing the word "Help" on
a single line of the mail message. Leave the Subject line in the mail
header blank. The help text will be updated when new information is
available for FASTA searches (such as new databases on-line). For
additional help on using FASTA, contact GenBank at (415) 962-7307 or
send an electronic mail message to the address:
CONSULTANT at GENBANK.BIO.NET
Formatting a Query
Queries consist of a mail message with search parameters identifying
the database to be searched, values related to the search and the
query sequence to be used in the search. The mail message has two
mandatory lines, three optional lines and a line identifying the query
sequence as descibed below. These lines are typed into the body of
the mail message in the order shown below:
Parameter Mandatory Explanation
DATALIB Yes This line specifies the database to be
searched (as described in the beginning of
this text) for the query sequence and must
be included in the message.
KTUP No This line identifies the Ktup value which
specifies the sensitivity of the search.
Values range between 3 and 6 for nucleic acid
searches and between 1 and 2 for protein
searches. Lower values specify more sensitive
searches but require more time to complete.
For DNA sequences longer than 200 base pairs,
use a Ktup value of 4 or greater; lower values
are unnecessary and take longer to complete.
Protein searches will benefit from having a
Ktup value of 1 if you expect significant
matches with evolutionary amino acid replace-
ments but few exact amino acid matches. The
default value for nucleic acids is 4 and 1
SCORES No This line specifies the number of best-ranked
sequences to be listed in the results. The
default value is 100.
ALIGNMENTS No This line identifies the maximum number of
best-ranked sequences to be aligned in the
results. The default value is 20.
BEGIN Yes This line must be included in the message. No
other information is typed on it.
The remainder of the message contains the query sequence in either
Pearson FASTA format or in IntelliGenetics format.
Preparing Files for Similarity Searches
Only one sequence query is allowed per mail query. The query sequence
that you would like searched in the database must be contained in its
own file. Your sequence file must be in either Pearson format or
IntelliGenetics format. GenBank database file format is not currently
accepted; however, it is possible to use an editor to change the file
to Pearson format as described below. Note: all lines must be less
than 80 characters in length; larger lines will be truncated.
Pearson is the preferred format to use for query sequences. The format
includes a mandatory comment line beginning with a greater-than sign ">"
followed by the name of the sequence, a space, and an optional note
about the sequence. The sequence data begin on the next line without
the greater-than sign. For example:
>AGREP4 Monkey SV40-like genomic segment promoting transcription.
If your sequence was derived using one of the IntelliGenetics programs,
it can be used for a FASTA search. Comment lines are optional and
begin with a semi-colon ";". The name of the sequence and the
sequence data appear on separate lines without a semicolon. At the
end of the sequence data a number must follow to indicate if the
sequence is linear (1) or circular (2). For example:
;Monkey SV40-like genomic segment promoting transcription.
GenBank Flat-File Format
GenBank database file format is NOT accepted for query searches. The
files contain annotation data and residue numbers that cannot be
recognized by FASTA. For example:
1 ccccttcaaa tctattacaa ggtgagcgtc tcgccaaggc aatgaaatcg caatatgatg
61 taaccttgcg ctttggatta gacggactgt taaacggcaa
These files can be used only if they are changed to follow Pearson
format. The files must be stripped of annotation data and the numbers
in the sequence; the mandatory comment line (starting with ">") must
then be added.
Sending the Query Sequence
Use your local mail program to send GenBank your query sequence. Most
mail programs allow you to import a file into the mail message. You
can import your sequence file into the mail message on the line after
"Begin". Please follow the format in the following example of a FASTA
request PRECISELY, but note that the program is case-insensitive, i.e.
either upper or lower case letters may be used.
This is an example of a mail message sent for a FASTA search. Note that
the first four lines are a mail header that is automatically created
when you address a mail message. Nothing need be entered for the
Subject. Each line of information must be less than 80
characters in length. Longer lines will be truncated.
From: drbob at someaddress.somewhere.edu Tue Jun 14 21:36:38 1988
Date: 14 Jun 1988 2129:02-PDT
To: SEARCH at GENBANK.BIO.NET
The text that you enter into the body of the message begins with DATALIB
(do not add blank lines in the message):
>BOVPRL GenBank entry BOVPRL from gbmam file.907 nucleotides.
The sequence is then sent to the FASTA Server at GenBank. Once your
message is received, it is placed in a batch queue and processed in
the order it is received. Two queues called the fast and slow queues
process FASTA requests. The slow queue handles nucleic acid searches
of "genbank/all" and "embl/all." All other requests are placed in the
fast queue. Searches submitted to the fast queue require less CPU
time and are completed more quickly than those sent to the slow queue.
If you would like to know the status of the queues being processed,
you can send a mail message to the FASTA Server address
(SEARCH at GENBANK.BIO.NET) containing the word "QUEUE" on a single line
of the mail message (Leave the Subject field blank).
The fast queue is labeled with the letter "d"; the slow queue is
labeled with "e".
You cannot have more than one search waiting in the slow queue at any
one time. If you send an additional search to the slow queue before
your first request has been processed, the initial search will be
cancelled. At MOST you can have one executing search and one waiting
job in the slow queue at the same time. Multiple jobs are currently
permitted in the fast queue.
Handling the Results of a FASTA Search
When the results are returned, use your local mail program to retrieve
them. You can transfer the results of a FASTA search to a separate
disk file to free up space in your mail directory. Consult the
documentation for your local mail program for the commands to transfer
and read mail. If you wish to obtain sequences of interest, use the
e-mail retrieval server mentioned below or the IRX searching system
available through the GenBank On-line Service. Contact GenBank for
Interpreting the Results of a FASTA Search
The mail message returned after the FASTA search will contain the
sequence name and length, the database searched, and the scoring matrix
used. When searching all of GenBank, each subdivision of GenBank will
also be displayed.
To achieve a rapid yet sensitive search, the FASTA program uses a
hierarchy of steps to determine scores for the sequences searched in the
database. There are cut off points in each of the scoring steps so that
only high scoring sequences are used in subsequent searching steps.
Three scores are tallied and reported: INITN, INIT1, and OPT. Each of
these scores is assigned to a sequence based on its rank at a specific
point in the similarity searching process.
In comparing the query sequence to a sequence in the database , the
following steps are taken to determine the three scores:
1. First, the ktup value is used to establish a matrix for comparing
sequences. A value of 4 for a nucleic acid means that each group of
4 consecutive residues of the query sequence and the database
sequence will be compared. The sequences are compared on two
perpendicular axes and a diagonal line is created when ktup matches
with residues of the two sequences occur.
2. By joining match regions along the same diagonal that are not
separated by excessive mismatches, initial regions of high similarity
are identified. The 10 best diagonal regions of high similarity are
used for further analysis.
3. An INIT1 score is then assigned to each region of high similarity.
4. Next, FASTA attempts to join regions on the diagonal and assign
them an INITN score. The INITN score is determined by adding each of the
INIT1 scores of the two regions to join and subtracting a constant
value of 20 as a joining penalty. If the combined value of the
region is less than the INIT1 score of either region, the regions are
not joined. In this case, the INITN score will be equal to the
INIT1 score of each region. Only the sequences that have an INITN
score above a set cutoff point are kept for possible alignment.
5. Sequences with the highest INITN scores are then used for a
Needleman-Wunsch/Smith-Waterman alignment to determine their OPT score.
The OPT score is used to evaluate the alignments produced by FASTA.
A histogram of the score distributions for both the INITN and INIT1
scores will be displayed in the results. The score value is given in
the left column and the number of sequences that were in that interval
is displayed in the two columns to the right. In the following example,
there were 377 sequences with INITN scores that were greater than 12 but
less than or equal to 16. In the graphic histogram, "+"'s and "-"'s
are used to distinguish the bars for INITN and INIT1 scores,
respectively, if the number of scores differ.
< 4 16 16:========
8 0 0:
12 1 1:=
16 377 377:==================================================
20 1272 1272:==================================================
24 2224 2224:==================================================
28 2717 2717:==================================================
32 3147 3147:==================================================
36 2921 2921:==================================================
40 2064 2064:==================================================
44 1243 1243:==================================================
48 568 568:==================================================
52 269 269:==================================================
56 105 105:==================================================
60 43 43:======================
64 21 22:===========
68 7 7:====
72 3 3:==
76 18 19:---------+
80 11 11:======
84 16 17:--------+
88 8 8:====
92 0 0:
96 1 1:=
100 0 0:
104 1 0:+
108 0 0:
112 0 0:
116 1 0:+
120 0 0:
124 1 0:+
128 0 0:
132 0 0:
136 1 1:=
140 0 0:
144 0 0:
148 0 0:
152 0 0:
156 0 0:
160 0 0:
>160 0 0:
KEY: + initn scores
- init1 scores
= no. of initn scores same as no. of init1 scores
The statistics of the search will be given after the histogram including
the total number of residues in the database searched, the number of
sequences searched, the average INITN and INIT1 scores with their
respective standard deviations, the number of scores that were above the
cutoff value, the value for ktup, and the value for fact.
searched 19156002 residues in 17047 sequences
mean initn score: 31.8 (s.d.= 8.44)
mean init1 score: 31.8 (s.d.= 8.44)
161 scores better than 55 saved, ktup: 4, fact: 4
The name and scores for the top 100 best-ranking sequences, as
determined by their INITN score, will be presented in the results. In
addition, the optimized alignments for the top 20 ranking sequences
are given as shown below. (Please note that the default values are
100 and 20 but may be more or less depending on the parameters and
query sequence submitted.) Only the region that was considered
significant by the program will be displayed.
The best scores are: initn init1 opt
>SYNPUC81A - Plasmid PUC8-1, a modified pUC8 vector wi 134 134 140
>M13TG117 - Phage M13tg117 cloning vector in 5' end of 122 84 87
>SYNPUC92B - Plasmid PUC9-2, a modified pUC9 vector wi 114 74 80
>M13TG115 - Phage M13tg115 cloning vector in the 5' en 103 63 63
>MUSP53MR - Mouse p53 cellular tumor antigen mRNA, com 96 96 96
>SYNPUC81A - Plasmid PUC8-1, a modified pUC8 vector wi
initn= 134 init1= 134 opt= 140 80.0% identity in 65 nt overlap
10 20 30 40 50 60
X:::::::::::::::::: :::::::::: :::: v^:: :::: :: : : ::
10 20 30 40 50 60
Library scan: 0:05:20 total CPU time: 0:05:23
After all the alignments are printed out, the CPU time used for the
library (database) scan and the total CPU time will be displayed.
The following table shows the symbols used in the alignment and their
: an exact match
. an ambiguous match or a match with a conservatively replaced
- a gap in the sequence
X boundaries of the initial region that are associated with
the INIT1 score
^ and v boundaries shifted during the final optimization step which
Interpreting the Scores
The OPT score is derived from the alignment and is generally the best
score to evaluate the alignments produced by FASTA. Please note that
the program prints the scores in the order given by the INITN scores
and not the OPT scores. In general, sequences with high INITN scores
usually have high OPT scores but this is not always true. Also, the
OPT scores are determined for only some of the database sequences,
therefore the mean and standard deviation are not calculated. These
statistics can only be calculated for INITN and INIT1 scores. For
more information on interpreting the scores produced by a FASTA
search, consult Pearson and Lipman's paper presented at the beginning
of the help text.
Calculating Time Usage
The processing time for a FASTA search depends on: the size of the
queued sequence, the database selected, the ktup value, the number of
requests in the batch queue and the load on the GenBank computer.
Retrieving DataBank Entries found with FASTA
Database entries can be retrieved by either locus name or accession
number. To use the GenBank Retrieval System, send an electronic
message to RETRIEVE at GENBANK.BIO.NET containing as text (leave the
Subject: line blank) either an accession number, or an entry name, but
not both. The message text should contain exactly one word.
The data banks are searched in the order: GenBank New Data, GenBank
current release, EMBL New Data, EMBL current release, GenPept New
Data, GenPept current release, and Swiss-Prot until a match is found.
If an entry exists in both GenBank and EMBL with the same accession
number (the usual case), a query on the accession number will return
the GenBank version of the entry. If the EMBL-format version is
required, it can be retrieved from the file server at
NETSERV at EMBL.BITNET (for instructions send a message containing the
line HELP to that address).
An electronic version of the sequence data submission form used by the
sequence data banks is also available through the RETRIEVE server. To
receive a copy, send a message containing the word DATASUB as the only
line. Instructions for completing and submitting the form are
The FASTA program (and other related programs) can be purchased for
VAX/VMS, SUN/Unix, IBM-PC and Macintosh computers. To obtain the program
for one of these systems, contact Dr. William Pearson at:
Department of Biochemistry
Box 440 Jordan Hall
University of Virginia
Charlottesville, VA 22908
or send electronic mail to: wrp at Virginia.BITNET
You can also obtain the programs by anonymous FTP from
uvaarpa.virginia.edu and accessing the file, public_access/fasta.shar
End of FASTA Server Help
More information about the Bioforum