GenBank DNA extraction and parsing software
frist at ccu.umanitoba.ca
frist at ccu.umanitoba.ca
Sat Dec 14 17:49:54 EST 1991
In article <OWHITE.91Dec13104947 at zeste.nmsu.edu> owhite at nmsu.edu (smouldering dog) writes:
>have other investigators written programs to extract and examine DNA
>sequences from GenBank?
> owen white (owhite at nmsu.edu)
SUN Unix users may be interested in features, a menu interface to XYLEM's
getob program. The manual page for features is appended to the end of this
message. The XYLEM package is written in SUN Pascal and c-shell, and is
available by anonymous FTP:
FTP: /var/spool/ftp/pub/psgendb/xylem.tar.Z at ccu.umanitoba.ca
Previous XYLEM users should probably obtain the current version, which
includes some improvements in fetch and findkey, as well as an updated
version of splitdb which is now compatible with both old and new protocols
for PIR ACCESSION/#Accession lines.
Brian Fristensky | I'm carrying the weight of all the useless
Department of Plant Science | junk a modern man accumulates
University of Manitoba |
Winnipeg, MB R3T 2N2 CANADA | I'm a statistic in a system
frist at ccu.umanitoba.ca | that a civil servant dominates
Office phone: 204-474-6085 |
FAX: 204-275-5128 | Billy Joel - Running on Ice
----------------------------- features.doc ---------------------------
FEATURES.DOC update 16 Sep 91
FEATURES - extracts features from GenBank entries
FEATURES is an interactive user-interface for GETOB, which
greatly simplifies the process of retrieving features from
GenBank entries. Features can be retrieved either by specifying
keywords (eg. CDS, mRNA, exon, intron etc.) or by evaluating
expressions in the Features Table format.
An example of the FEATURES interactive menu is shown below:
FEATURES - Version 16 Sep 91
Parameter Description Value
1).................... FEATURES TO EXTRACT ....................> f
f:Type a feature at the keyboard
F:Read a list of features from a file
2)....................ENTRIES TO BE PROCESSED (choose one).....> n
Keyboard input - n:name a:accession # e:expression
File input - N:name(s) A:accession #(s) E:expression(s)
3)....................WHERE TO GET IT .........................> g
u:User-defined database subset g:complete GenBank database
4)....................WHERE TO SEND IT ........................> a
s:Each feature to a separate file a:All output to same file
Type number of your choice or 0 to continue:
Messages will be written to MPOCPCG.msg
Final sequence output will be written to MPOCPCG.out
Searching index file /home/psgendb/GenBank/gbacc.idx ...
Retrieving entries from /home/psgendb/GenBank/gborg ...
In the example, FEATURES was instructed to retrieve all tRNAs from
the GenBank entry MPOCPCG, which contains the liverwort plastid
genome. By default, the GenBank database was the source of the
sequence. Messages indicate the progress of the job. A log describing
the extraction of each feature is written to MPOCPCG.msg, while the
extracted features themselves are written to MPOCPCG.out. The first
step is to retrieve the MPOCPCG entry from GenBank, which is
accomplished by calling FETCH. Next, FEATURES extracts the specified
features from the entry.
An excerpt from MPOCPCG.msg is shown below, describing the extraction
of the seventh tRNA found in this entry. To create this tRNA, two exons
had to be joined. The qualifier line associated with this feature
indicates that it is an Isoleucine tRNA with a gat anticodon.
The actual sequence for this feature, as written to MPOCPCG.out, is
written with each exon beginning a new line:
1) FEATURES - choosing f will cause FEATURES to prompt for
a feature to extract. If you wish to extract several types of
features simultaneously (ie. F), you must construct a file listing the
feature keywords. The following example would retrieve both tRNA and
The words 'OBJECTS' and 'SITES' must enclose the feature keywords,
and each keyword must be on a separate line. For a rigorous
definition of the input file format, see the GETOB manual pages
In the menu shown above, f was chosen, and the user entered tRNA at
the prompt. Thus tRNA is now displayed on the Features: line. If
features had been specified from a file (suboption F) then the
filename containing the feature keywords would be displayed instead.
A complete list of legal feature keywords can be found in the GenBank
Release notes (gbrel.txt) under the subheading 'Feature Key Names'.
n User is prompted for the name of an entry from which the
feature is to be extracted. The name of the entry will appear
on the 'Entries' line of the menu.
N User is prompted for a filename containing one or more
entry names. Each name must be on a separate line. The filename
will be displayed on the 'Entries' menu line.
a User is prompted for an accession number, which will appear
on the 'Entries' line of the menu.
e User is prompted for a GenBank Features expression of the
form @accession:location. Suboption E will cause the user to
be prompted fo a filename containing one or more Feature
expressions. 'accession' refers to a GenBank
accession number, while 'location' is any legal feature location.
A brief description of location syntax can be found under the
subheading "Feature Location" in the GenBank release notes
(gbrel.txt). See "The DDBJ/EMBL/GenBank Feature Table:
Definition" Version 1.02 for a complete definition.
The tRNA shown above could have been extracted by choosing
suboption e and entering either of the following expressions:
In the first example, the feature line from the original entry
is used as the location. In the second example, the feature is
found by its qualifier line, which also appeared in the
original entry. It must be noted that the qualifier line must
be unique from others in the same entry in its first 15
characters after the = .
The flaL protein coding region of B. licheniformis is described
in GenBank entry BLIFALA, accession number M60287 in the
/note="flaD (sin) homologue"
This feature could be retrieved using any of the following
@M60287:/note="flaD (sin) homologue"
Note that the /label= qualifier is special, in that labels are
specifically intented as unique tags on an feature. For labels,
only the label itself is need be specified. Thus, /label=ORF2 is
equivalent to ORF2. For other qualifiers, the qualifier keyword
(eg. /note=) must be included.
3) DATABASE (WHERE TO GET IT) - By default, all entries processed will
be automatically retrieved from GenBank using FETCH. Specifying 'u'
(User-defined database subset) makes it possible to extract features
from GenBank subsets created by the user. Usually, retrieval of
features is much faster with a User-defined subset, so if you
frequently work with sets of genes, it is best to retrieve them
en-masse using FETCH, and work with them directly. For example, if
you had retrieved a set of Beta-globin sequences into a file called
'globin.gen', you could directly extract features from these entries
by specifying 'globin' or 'globin.gen' as your User-defined database.
If the file extension is '.gen', FEATURES will automatically create
temporary files called globin.ano, globin.wrp and globin.ind,
containing annotation, sequence, and an index, respectively. These
files will be read during feature extraction, and then discarded. If
you have already created such files using SPLITDB, simply specify
any of 'globin', 'globin.ano', etc. ie. anything, as long as it does
not have the .gen file extension.
One consequence of these conventions is that the individual GenBank
files can be processed directly. For example, suppose you were only
interested in rodent globins. You could directly access the rodent
division of GenBank by specifying the base name of that file division
(eg. /home/psgendb/GenBank/gbrod) as your user-defined database. In
this case, the files gbrod.ano, gbrod.wrp and gbrod.ind already
exist. Again, this approach is faster, since FEATURES would not have
to find and retrieve the sequences, but can read directly from the
database files. Finally, if you wanted to process all of the entries
in the database division, simply use the index file for that division
as your namefile (suboption N of option 2). The user is warned that a
GenBank division is a huge amount of data, and processing every entry
could take a long time.
4) WHERE TO SEND IT - By default (a), the output for all entries goes
to a single set of files, whose names are chosen by FEATURES,
depending on the setting of option 2, Entries. If a single name (n) or
accession number (a) has been chosen, that will be used as
the raw filename. For example, if you were processing the entry
WHTCAB, the output files would be WHTCAB.msg and WHTCAB.out. If names
(N), accession numbers (A) or expressions (E) were read from a file,
the raw name of that file would be used eg. cel
More information about the Bio-soft