Announcing: DNA WorkBench
tisdall at amalthea.humgen.upenn.edu
Fri Dec 3 15:26:49 EST 1993
A program for sequence searching and manipulation.
* It's free.
* It runs on Unix and Macs (and PCs - see below) with internet connections.
* Powerful and fast searches on Genbank and other databases.
* Client-server: access remote databases and programs.
* Parallel distributed processing for huge databases of the near future.
* Full automatic control: by script, command-line or standard input.
* Enhancable - run your own programs on databases and search results.
* Handles most sequence file formats silently - reformats easily.
* Many sequence manipulation functions.
* Convenient user interface, complete on-line help.
HOW TO GET IT
ftp cbil.humgen.upenn.edu (or, ftp 18.104.22.168)
login as "anonymous", give your email address as a password
mget ANNOUNCE INSTALL README (and, if you want, CHANGES and PERL-FAQ)
For UNIX: cd unix
For PC: cd pc
For MAC: cd mac
Please send mail to tisdall at cbil.humgen.upenn.edu if you install
the program. You will then be informed of program updates and other
The program is configured to access our network of Sun workstations
for database and program service. Since we are not funded to provide
this service to the world, we have a limit on the number of connections
allowed at any one time. In addition, should problems arise, the
service may be curtailed. You are encouraged to configure the program
on your systems to use your own databases and programs.
This is a so-called "alpha release" of the program. Although we have
been using it successfully for about a year as it has developed, it has
not been tested on a large number of variant Unixes or Macs. Installation
is expected to be straightforward; if you have problems, let us know,
and we will improve upon the installation instructions in the INSTALL file.
In particular, we have used the Unix version a lot - it's fairly stable.
The Mac port is rather recent, and there are no doubt some more bugs
to shake out. It also requires lots of memory.
The MSDOS PC port (which does not include sockets) is almost entirely
untested at this date.
--- It's free.
DNA WorkBench is copyrighted, but permission is granted for
non-commercial use, or for research purposes, under the usual conditions.
Details provided with the program. For commercial use, contact the author.
DNA WorkBench is written in the Perl language, which is also
free and is available for most computers. Perl is a C-like language
with many string manipulation features and supports object-oriented
programming. Perl is a great language for slinging DNA and for
biocomputing in general.
--- It runs on Unix, Macs, and MS-DOS PCs.
All that is required is that Perl be installed. For certain
UNIX computers, binaries are provided that do not require additional
Perl installation. Instructions on how to get Perl are included
in each distribution.
Perl has also been ported to VMS, Amigas, NT, OS/2, Atari ST,
and practically any other computer system you can think of. However,
I've only worked with UNIX, Macs, and MS-DOS PCs so far.
Internet connections are required for the database searching and
retrieval. This enables small computers to use the big servers for
database storage and program execution, while doing the easier jobs
and storing results locally. Alternatively, if you have the disk
space, you can configure the system to use your own local database.
At present, a good port to PCs that includes the internet "sockets"
connectivity is not available; but one is expected soon, now that good public
domain socket libraries have become available.
--- Powerful and fast searches on Genbank and other databases.
DNA WorkBench lets you get information out of Genbank and other
databases in a number of useful ways.
Several fields are indexed; for example, "organism sapiens" will
return with the many thousands of human DNA entries very rapidly.
It is also possible to search for arbitrary text through an entire
database very rapidly. This is accomplished by running the search in
parallel on many machines, each machine doing a different part of the search.
The searches may be for sequence or for other text. Searches may
be specified in terms of a very complete regular expression language, which
gives powerful user-defined full-text search capabilities.
Results of searches may be combined in various ways, for instance
by taking the union, intersection, or difference of sets of search results.
Results of searches may be narrowed down by additional criteria;
the user can define programs to run against search results; and search
results can be saved and reloaded at a later session.
--- Client-server: access remote databases and programs.
DNA WorkBench is both a genetic database server and
a program server. For instance, the popular BLAST, FASTA and PRIMER
programs can be run by DNA WorkBench on a remote machine, using your local
files and input, and the results returned to your local machine.
On startup, the program connects with a server that provides the
locations of the databases and other services. This reduces administration
requirements. The program then makes connections with the various servers
as required. If a server is unavailable, the program attempts to connect
with the alternate servers, if any are specified.
The servers impose a timeout, which helps reduce the load on the
servers, while being transparent to the client, which silently
re-establishes connections which have timed out. Also, the servers
impose a limit on the number of concurrent server processes they permit,
ensuring that the server system is not overloaded.
--- Parallel distributed processing for huge databases of the near future.
Large databases can be distributed over many machines. As the
amount of genetic information grows (it seems to be doubling about every
1.7 years) it will be possible to simply throw more and/or faster
machines into the network to maintain rapid search and retrieval.
This is very easy to configure - one simply adds the machine and file
name to the program, and installs the program on the new machine.
--- Fully automatic control: all functions can be program-controlled.
Large projects such as the Human Genome Project, which provided
the environment in which DNA WorkBench developed, as well as small
projects centered in an individual laboratory, can benefit from the
automation of biocomputing tasks. Every function of DNA WorkBench can
be controlled by specifying it in a script, or by including it on the
command line, or by reading it from the standard input as it is piped
in, or as the user types it. For instance, new sequence can be analyzed
in a variety of ways, and the results summarized and mailed to the
researcher, without the time-consuming chore of typing inputs and
responses to a variety of programs.
--- Enhancable - run your own programs on databases or search results.
You can write a program that processes one database record,
and then specify the database libraries, or the previous search
results, to run the program against. The user can build up a collection
of programs, and easily specify to DNA WorkBench which program to run
and against what library or search result.
The user can also enter arbitrary commands to the running program,
for instance, to examine the state of the program, or to perform some
--- Handles most sequence file formats silently - reformats easily.
Files containing sequence data in several common file formats
are read and parsed without any action necessary by the user.
Data can be reformatted by giving the name of the desired format.
--- Many sequence manipulation functions.
This list keeps growing. At present, it includes calculating
the reverse complement, displaying reading frames and nucleotide to
protein translations, editing, searching for restriction
enzyme sites, searching for human repeat or vector in a sequence,
comparing a sequence against a library or a user file, searching for a
regular expression in a sequence.
On my to-do list: contig assembly, searching by context-free and
other grammars, feature table manipulation, GDB access, approximate
regular expression matching, physical mapping of chromosomes, ...
--- Convenient user interface, complete on-line help.
Search results are easily manipulated, so you don't have to remember
and type long filenames. Search results are maintained in an array; the
size and current pointer are displayed in the prompt. One-line headers or
complete records are shown with the minimum possible number of keystrokes.
All searches are saved and can be quickly retrieved and combined, with
a minimum of disk and memory usage.
The online help facility contains all the documentation, plus
some short tutorials. It is based on examples, not formal descriptions.
Any command not recognized by the program is passed on to the
system. The user can perform most tasks without leaving the program,
which makes it behave (almost) like your shell with many new commands and
a stack of sequence.
Departments of Genetics and Computer and Information Science
Computational Biology and Informatics Laboratory, Human Genome Project
510 Blockley Hall
University of Pennsylvania
tisdall at cbil.humgen.upenn.edu
More information about the Bio-soft