ABI basecalling performance
aberno at genome.stanford.edu
Tue Feb 7 18:04:57 EST 1995
I did a quick experiment this afternoon to look at the relative
performance of basecalling using ABI's older "Analysis" and newer
"Sequencing Analysis" applications, using both the Standard and Adaptive
basecallers in the latter case. My method was to take 20 trace files,
culled from our production sequencing, which had been determined to be
data from our cosmid cloning vector. I compared the sequences called with
each basecaller to the known sequence for the cosmid vector using my own
Smith-Waterman alignment code on bases 30 to 330 in each file. (I started
at base 30 in order to avoid the M13 cloning vector, which of course would
not align to the cosmid vector sequence.) The results are interesting.
Here is a table summarizing the resulting average number of errors in this
300 base region:
Old Standard Adaptive
Overcalls 0.55 0.55 2.9
Undercalls 1.5 2 4
Mismatches 1.7 1.7 5.5
N's 2.8 3.3 7.1
Two files were not included, as one failed in the old and standard
basecallers, and another failed in the adaptive basecaller - by "failed",
I mean that there were so many errors that the alignment algorithm could
not figure out what part of the cosmid vector the sequence came from.
I don't think that the differences between the "old" and standard
basecallers are significant; indeed, it seems to be the identical
algorithm with only a few minor tweaks. However, it is clear that the
adaptive basecaller is much less accurate on data where the standard
basecaller can function.
A few points, and a disclaimer:
- N's were considered as matching to anything.
- I checked to make sure that I was comparing roughly the same sequence in
each case. the base numbering was slightly different (+/- 10 bases) from
each basecaller, since the basecalling started at different locations.
This effect might have skewed the results slightly, but I see no evidence
that it did.
- The adaptive basecaller is probably quite useful on data where the
standard basecaller fails, but my dataset does not contain any data of
- I'm fairly confident that my Smith-Waterman code is correct, but I
haven't rigorously demonstrated this.
Disclaimer: Determining error rates in basecalling is something of a black
art, and it's surprisingly hard to get numbers that are believable. In
general, I don't believe anything calculated on fewer than 100 files, so
take this data with a grain of salt. Factors like incorrect operation of
the software, chimaeric clones, programming errors, and a host of other
things can mess up the numbers. Although I tried to eliminate these
factors to the extent I could, this was just a quickie test to satisfy my
curiosity, and as such I would not take these results as the last word on
Anthony Berno -- Stanford DNA Sequencing and Technology Center
Business: aberno at genome.stanford.edu Personal: aberno at netcom.com
Work: (415) 812-1972 Fax: (415) 812-1975 Home: (510) 487-6725
WWW Home Page: http://genome-www.stanford.edu/~aberno/home.html
More information about the Bio-soft