In article <CDEz1A.7s3 at usenet.ucs.indiana.edu> wfischer at bio.indiana.edu (Will Fischer) writes:
>>I need to extract given pieces of sequence from a set of EMBL/GenBank
>flat file entries (as retrieved from NCBI's email server), using ranges
>defined in the features table. For example, I'd like to be able to
>extract, from a set of entries, the DNA sequence for each exon, or
>again for every complete CDS feature (all exons assembled).
>>Surely not everyone does this manually?
>>What software exists that can actually parse the (eminently parsible)
>joint features table format? Please post reviews of programs you have
>used, or mail me directly and I will summarize.
>>-- Will Fischer
>If you're working in a Unix environment, the XYLEM package can do what
you want. At present, only Pascal source and SUN Sparc executable code
are available. I now have a version in C that is in the testing stage
and should be released in the next few weeks.
While the technical problem of parsing features is difficult, XYLEM
demonstrates that it can be done. The much more formidable problem
comes from errors and inconsistencies in the database itself. Many
words on this subject have been written by me and others in bionet.*
over the last few years. Recognizing that we're going to have to live
with these problems for years to come, XYLEM has capabilities that
facilitate human intervention where necessary.
X Y L E M
XYLEM is a package of tools designed to exploit the Unix
environment to enable the user to identify, extract and
manipulate data from major databases such as GenBank, EMBL and
PIR. Fundamental to the power of these programs is the ability to
perform operations on groups of sequences, represented by names
or accession numbers which function as virtual database subsets.
The most powerful program is FEATURES, which uses the GETOB parser
to evaluate GenBank/EMBL/DDBJ Features Table expressions, thereby
extract features (eg. mRNA, sig_peptide, intron) from lists of
entries. Additional programs perform operations such as translation
or randomization of datasets, and formatting of multiply-aligned sequences
for publication. XYLEM is compatible with the Fristensky Sequence
Analysis Package, and the Pearson FASTA programs, and can be used
from within the Genetic Data Environment (GDE) of Steven Smith.
FTP: psgendb/xylem.tar.Z at ftp.cc.umanitoba.ca
Brian Fristensky |
Department of Plant Science | A question is like a knife that slices
University of Manitoba | through the stage backdrop and gives us
Winnipeg, MB R3T 2N2 CANADA | a look at what lies hidden behind.
frist at cc.umanitoba.ca |
Office phone: 204-474-6085 | Milan Kundera, THE UNBEARABLE LIGHTNESS
FAX: 204-261-5732 | OF BEING