ICAtools - New DNA analysis tools

Jeremy Parsons jparsons at crc.ac.uk
Mon Oct 19 09:17:45 EST 1992


Dear Netters,

Some of my clustering and database searching programs  have  been
described  in  a  recent  publication (see below) but some others
have not been announced before. The programs cover a  wide  range
of applications and should be of interest to many people. I wrote
all the programs on a Sun but I tried to keep the code  portable;
all input and output is ASCII apart from index storage.

The copyright on all my code rests with the British MRC but I ex-
pect that they would be happy to see more academic users. I would
be particularly interested to hear from anyone willing to  inves-
tigate the programs' portability.

"Clustering cDNA sequences", 1992, Parsons, J.D., Brenner S.  and
Bishop M.J., Comput. Applic. Biosci., Vol 8, pp 461-466

Jeremy Parsons


==========================================================================

The ICAtools
------------

A set of programs has been written to quantify  the  similarities
between  large  numbers of DNA sequences. Using this information,
similar sequences are clustered together at the rate of thousands
of  sequences  per day. The cluster structure information is kept
in a small, space-efficient index file which  ensures  that  disk
requirements  are  negligible.  Index  files  are  used to create
selective views and summaries of the entire sequence  dataset  of
interest.  These summaries can form a useful overview analysis of
the data produced by any large-scale DNA sequencing project.  The
ICAtools are useful for:

i)      finding novel sequence families;
ii)     database searching;
iii)    linker and vector screening;
iv)     determining the point of effective exhaustion of cDNA libraries;
v)      sequence overlap detection as a precursor to contig building;

Linker and vector screening can normally be performed using a da-
tabase  searching  program such as BLAST or a specialised program
like Roger Staden's Vep, but these methods rely  on  knowing  the
exact  sequence  of  such artifacts. This information may be con-
fused or unavailable because of an experimenter's  administrative
mistakes  and  protocol  errors or because of commercial secrecy.
The ICAtools do not need any  guiding  information  to  find  the
over-represented  sequence segments that characterise cloning ar-
tifacts. Thus, the programs have the ability to  find  "features"
that their users didn't know they were looking for.

When used for database searching, one of the  tools,  ICAass,  is
more  sensitive,  though  less  quick, than BLAST and faster than
FASTA for batches of sequences.

Together, the ICAtools are a useful and flexible package for  both
the  data-mining  and the quality control of large DNA sequencing
projects. They have been used at the NCBI in the USA and  in  the
UK where they feature in the HGMPRC Computing Facility's menus.

ICAtool

ICAtool is a  jack-of-all-trades  cDNA  clustering  program.  The
basis  of  the program is a FASTA-like algorithm which is used to
compare pairs of sequences. A full dynamic-programming  algorithm
was  implemented  but then abandoned because it was unnecessarily
sensitive and slow. As an aid to performance, the results of pre-
viously  calculated  comparisons  are used to guide the choice of
which sequences are subsequently compared. This gives the program
a best-case computational complexity of order 'n', where n is the
number of sequences being clustered.

In addition to clustering similar sequences together, ICAtool can
perform  a rapid, focussed database search. In query mode, a pre-
prepared cluster index file is used to allow the  searching  pro-
cess  to spend a disproportionately large amount of its time com-
paring the query sequence  against  those  indexed  sequences  to
which  the query is most similar. ICAtool, by using file-pointers
rather than creating yet  another  sequence  format,  allows  the
simultaneous  use  of 5 different, existing formats and also uses
negligible disk space by avoiding unnecessary information  dupli-
cation.

n2tool

The program n2tool is similar to ICAtool because it performs  DNA
clustering  and shares the ICAtool cluster index file format. The
programs differ for a few reasons:
 i)     n2tool cannot  be  used  for  querying;
ii)     n2tool's pairwise comparison algorithm is more BLAST-like than
        FASTA-like;
iii)    n2tool is guaranteed to compare all the submitted sequences
        against each other.

Using datasets typical in our laboratory (thousands  of  ~300  bp
fragments)  n2tool is quicker than ICAtool and produces more con-
cise clustering. n2tool is the only program used  for  clustering
genomic data because its clustering algorithm is less affected by
multi-domain repetitive sequences.

Both n2tool and ICAtool can incrementally expand their indexes to
allow  extra  sequences to be added at any time; this is achieved
at minimal cost by not repeating previous calculations.  All  the
programs share the same concise index structure.

ICAass

There are some clustering  applications  for  which  ICAtool  and
n2tool  are  inappropriate because they use local-similarity com-
parison algorithms. When clustering, the program  ICAass  uses  a
novel  global-similarity  algorithm  which determines whether one
sequence is an approximate subsequence  of  another.  ICAass  has
been used to cluster a size-sorted EMBL DNA database and was able
to shrink the database files by upto 50% by removing all approxi-
mate subsequences.

In addition to shrinking databases, ICAass can be used  to  query
indexes which it does very quickly using a local similarity algo-
rithm and without the need for any specially formated databases.

ICAprint and ICAstats

ICAprint and ICAstats are a pair of programs that can be used to-
gether  to  display  how  sequences have been clustered together.
ICAprint has many options that  allow  selected  subsets  of  se-
quences  to  be  printed  out.  This allows, for example, an easy
selection of those sequences which didn't match any others or the
selection of single example sequences, one from each cluster.

ICAstats takes the output from ICAprint and produces an  overview
of cluster sizes and some related statistics. ICAstats is partic-
ularly useful to groups sequencing  cDNA  libraries  because  the
program  uses  a  Poisson model to predict the number of, as yet,
unfound sequences left in the current library.

ICAmatches

ICAprint can clearly show which sequences have been clustered to-
gether  but the task of explaining why is left to ICAmatches. The
explanation necessarily involves showing some form of  alignement
but the traditional multiple alignment style would be too verbose
and, by only marking conserved  bases,  unimformative.  In  every
cluster there is one type example sequence chosen by the cluster-
ing program. ICAmatches creates a novel style of multiple  align-
ment  by  printing underneath a listing of the type sequence, the
cumulative frequencies of those other sequences  in  the  cluster
that  match to that windowed region of the example sequence. This
allows a quick estimation of where and why all the  sequences  in
any  cluster  were  put together. This is the best tool for
identifying unknown vector or linker sequences.




More information about the Bio-soft mailing list