7 Database Search Programs: a Comparison

Duncan Rouch ROUCHDA at VMS1.ACADEMIC-COMPUTING-SERVICE.BIRMINGHAM.AC.UK
Fri Feb 11 08:58:55 EST 1994


Hi Netters,
	I have received many requests for our paper
"A comparison of seven protein database search programs",
so I thought it better to post it here on Bio-soft,
as it is short one.  Thanks to BINARY and Bio-line for allowing
the paper to be posted before publication.  Please don't
send any more reprint requests.

Why seven?, well those were the ones we could hook
up to the same database, so e-mail servers couldn't be
included.

In presenting our paper comparing database search programs
I invite people to carry out performance comparisons
in other areas of bioinformatics, as this kind of work is
a bit thin on the ground.  I know most academic programmers
tend not to be too keen on comparison work and I understand 
the reasons for this (we plan to discuss this as part
of a future document).  However, 

	(1) biologists aren't doing their best research if they 
	aren't using the most efficient and effective applications 
	from bioinformatics.  So it is important to do and publish
	these tests.  

	(2) Comparison should also promote improvement in 
	applications, as well as support the core pure science 
	research in bioinformatics.

By the way, our next discussion document should be out soon,
about improving the impact of bioinformatics in biology.


Duncan Rouch
School of Biological Sciences, University of Birmingham, UK


------------------------------------------------------------
A comparison of seven protein database search programs*
-------------------------------------------------------
BINARY (1994) 6:17-18.

*This version is as for BINARY, but you will have to see the journal 
to see the figure, however I've put table in to stand in for it.
See the Appendix for information on obtaining BINARY or the
Bioline version.



Duncan A. Rouch, Nigel L. Brown and Alan J. Bleasby1

School of Biological Sciences, The University of Birmingham, 
Birmingham B15 2TT, U.K.
Electronic mail: D.A.Rouch at uk.ac.bham, N.L.Brown at uk.ac.bham

1 SEQNET, SERC Daresbury Laboratory, Daresbury, Warrington WA4 
4AD, U.K.
Electronic mail: A.Bleasby at uk.ac.daresbury




Address for correspondence:

Dr D.A. Rouch 
School of Biological Sciences
University of Birmingham
Edgbaston
Birmingham B15 2TT
UK

Telephone: (021) 414 6551
FAX: (021) 414 6557

When a contiguous gene reading frame of unknown function 
is identified in a nucleotide sequence, the next step is 
usually to search for proteins homologous to the translation 
product.  We have attempted to determine which of a range of 
programs are best suited to such an initial general 
comparison of a protein sequence against a database.  Seven 
programs were used; Wordsearch (1), FASTA (2), GBLASTA (3), 
BLASTP (3), BLAST3 (4), SWEEP (5) and PROWL (J.K. Crook and 
J.F. Collins, unpublished).  All programs were configured to 
search the PIR23 protein database (all sections, 6), and 
were executed with default parameters as far as possible in 
order to most closely approximate the way these programs are 
used in practice by most molecular biologists.  The query 
sequence used was Human b-globin (PIR23, entry HBHU).  The 
sequence was used both complete and as contiguous 
derivatives of a third of the total length.  A globin was 
chosen as the probe due to both the recurrence of globins in 
the database and the range in the degree of pairwise 
similarity amongst these.  Furthermore, the identities of 
homologous sequences within PIR23 can be established 
independently, by scanning the list of names; there are 499 
globin family sequences in PIR23 from a total of 14,372 
sequences.

In order to compare the results from different algorithms 
a  new, program-independent, evaluation method was required 
since the scoring systems of most of the programs are 
unique.  The ability of the programs to detect homologous 
globin sequences was measured as follows.  Result lists from 
database searches, ordered by score, were scanned downwards 
with a window of 10 sequences until the number of globins in 
the window fell to 5. The number of globins in the result 
list  above and including the last globin in the window was 
then determined.  Finally, the globin count was converted to 
a percentage of the total number of globins in the database.  
This method might have given biassed results if different 
programs embedded non-homologous (non-globin) sequences in 
different ways, in regions where there was a high density of 
homologous sequences.  However, empirical tests indicated 
this effect to be negligible.  The method thus allowed an 
objective evaluation of how well each program can detect 
homologous sequences.

Using complete human b-globin as the query sequence, the 
programs showed a range of efficiency in detecting other 
globins, Table 1.  The top three programs, using this method,  
show similar globin recoveries, these were PROWL(90.8%), 
SWEEP(90.0%) and BLAST3(90.6%). The other programs gave 
scores between 73.5% and 87.8%.  When the shortened b-globin 
sequences were used as probes there was a drop of 
approximately 20% in globin detection for most programs.  
This is consistent with the length dependence of the scoring 
techniques used by the programs.  The top three programs 
were the same as with the first test (PROWL 72.5%, SWEEP 
70.3%, BLAST3 72.5%).  Of these three programs, BLAST3 has 
the limitation that there must be at least two homologous 
sequences in the database for homology to be found, as it 
depends on 3-way alignments.  The drop in globin detection  
for shorter sequences was most pronounced for Wordsearch, a 
program from the UWGCG package (1).  In summary, this  
method  suggests that of the programs tested, for general 
protein database searching,  PROWL (Prosrch), SWEEP and 
BLAST3 are the best programs to choose.



Table 1.  Performance of database search programs in globin detection. 
_____________________________________________________________
Programs	detection of	detection of 
		b-globin (%)	shortened b-globin %
-------------------------------------------------------------
PROWL		90.8		72.5
SWEEP		90.0		70.3
BLAST3		90.6		72.5
BLASTP		87.8		63.5
GBLASTA		86.6		65.3
FASTA		76.0		60.1
WORDSEARCH	73.6		30.9
_____________________________________________________________

Table 1, Performance of database search programs in 
globin detection.  Programs evaluated were PROWL 0.1 
(PR),  SWEEP 1.0 (SW), BLAST3 * (BL3), BLASTP * 
(BLP), GBLASTA * (GBL), FASTA 1.0 (FA) and  
Wordsearch 7.0 (WO): *, version as at 9/1992.  
Although not yet distributed, PROWL is equivalent to 
Prosrch (7), which is accesible on the SEQNET node at 
Edinburgh, U.K..  Percentage detection of globin 
family sequences in PIR23 is shown for query 
sequences, human b-globin (light hatching) and 
shortened b-globin derivatives (heavy hatching): in 
the latter case each third of the globin sequence was 
queried independently, and  the three results 
averaged.  All programs were run to give pairwise 
alignments with default parameters, except to give 
extended result lists (BLAST-type programs were 
executed  with S=35, R=1.0, and L=105, where 
applicable).


ACKNOWLEDGEMENTS 
----------------
We thank James Crook (for making PROWL available) and 
Academic Computing Service staff at Birmingham.  This work 
was supported by the Science and Engineering Research 
Council (CCP11) and Medical Research Council (Grant 
G.9025236CB to N.L.B.).


References
----------
1.  Devereux, J., Haeberli, P., and  Smithies, O. (1984) A 
comprehensive set of programs for the VAX.  Nucl. Acids. 
Res. 12, 387-395.
2.  Pearson, W.R., and Lipman, D.J. (1988) Improved tools 
for biological sequence comparison.  Proc. Natl. Acad. 
Sci. USA 85, 2444-2448.
3.  Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and 
Lipman, D.J. (1990) Basic alignment search tool.  J. 
Mol. Biol. 215,  403-410.
4.  Altschul, S.F., and Lipman, D.J. (1990) Protein 
database searches for multiple alignments.  Proc. Natl. 
Acad. Sci. USA 87, 5509-5513.
5.  Akrigg, D., Bleasby, A.J., Dix, N.I.M., Findlay, 
J.B.C., North, A.C.T., Parry-Smith, D., Wooton, J.C., 
Blundell, T.L., Gardner, S.P., Hayes, F., Islam, S., 
Sternberg, M.J.E., Thornton, J.M., Tickle, I.J., and 
Murray-Rust, P. (1988) A protein sequence/structure 
database.  Nature  335, 745-746.
6.  George, D.G., Barker, W.C., and Hunt, L.T. (1986) The 
protein identification resource (PIR).  Nucl. Acids Res. 
14, 11-15.
7.  Coulson, A.F.W., Collins, J.F., and Lyall, A. (1987) 
Protein and nucleic-acid sequence database searching - a 
suitable case for parallel processing.  Computer J. 30, 
420-424.


Appendix: BINARY and Bioline information
-----------------------------------------
Binary is an international journal which publishes a broad range 
of articles related to all  aspects of computing as applied to 
microbiology. 

SUBSCRIPTION INFORMATION: 6 Issues per annum.  Submissions and 
subscription information from the editorial office at the
School of Pure & Applied Biology
University of Wales
College of Cardiff
PO Box 915
Cardiff CF1 3TL, UK
Tel: 0222 874000 x 5743/4974;
fax: 0222 874305;
email: sabjbe at uk.ac.cardiff.thor 

BINARY- Computing in microbiology, whose  contents list appears 
regularly in the BIO-JRNL newsgroup, is now available in an 
electronic format, downloadable from the Base de Dados 
Tropical (BDT), Brazil.

Abstracts and summaries of papers in BINARY are all available free 
of charge.  The system is easy to use since it is available 
through the increasingly familiar gopher system on the Internet. 
Instructions or use are provided from option "Instructions for using 
Bioline Publications" on the main menu.

For more information, please email to 
BIO at BIOSTRAT.DEMON.CO.UK
or mail/fax to:

     Bioline Publications
     Stainfield House
     Stainfield
     Bourne
     Lincs PE10 0RS, UK

     Fax:   +44 778 570175
     Tel:  +44 778 570618




More information about the Bio-soft mailing list