Dear mol-evol friends,
A problem for scientists interested in (automated) sequencing has been
comparing the different available sequencing systems and their algorithms
on their sequencing accuracy. Different companies promote their system,
and all claim a very high accuracy. The scoring system for accuracy is
always ill-defined, and is invariably very forgiving for the shortcomings of
their own sequencer. As a way to compare different systems, and to measure
the accuracy of the base calling algorihms, we would like to propose a
uniform scoring system for the assessment of DNA sequencing accuracy. I
wrote a small draft about this, and would like to have this read by
the molecular-evolution mailing list before submitting this somewhere.
I would welcome your advice and comments on this manuscript, as well as
suggestions on where to publish this eventually. If you are interested in
seeing the figure, I will fax it to you if you E-mail me a fax-number.
Thank you for your help !!
UNIFORM SCORING SYSTEM FOR THE ASSESSMENT OF DNA SEQUENCING ACCURACY
DNA sequencing has become one of the most widely used techniques in
molecular biology. This is reflected in the myriad of sequencing
protocols and strategies. All sequencing methods aim for the expeditious
generation of large amounts of sequence information. Optimizing and
automation of DNA sequencing technology is a priority when contemplating
megabase sequencing efforts such as the human genome project. Automated
DNA sequence analysis, achieved by fluorescence-based labelling and
detection methods, emerges as an alternative to classical autoradiography.
A number of non-isotopic automated sequence analysis systems, such as the
ABI370A (Applied Biosystems, Inc.), A.L.F. Automated Laser Fluorescence
(Pharmacia LKB Technologies), BaseStation (Millipore Corporation), and
GENESIS 2000 (Dupont de Nemours, Inc.) are/were commercially available.
Overall productivity in these systems is determined by the volume and
accuracy of the generated sequences. The volume parameter seems relatively
easy to quantify comparatively, but volume is dependent on accuracy. A
sequence should be considered as being terminated when the declining
accuracy does not warrant further analysis of the sequence downstream.
Comparing the accuracy of the resolution enhancement and automated base
calling algorithms of different sequencing systems proves difficult due to
differences in the scoring methods.
We propose a more uniform and balanced scoring system for the assesment of
sequencing accuracy. Not all errors are equally significant in compromising
sequencing acccuracy. When a base calling routine designates a IUPAC-IUB
ambiguity code 'S' (C ro G) to a base which is in reality a G, this is a
relatively minor flaw which should not be penalized with the same gravity as
an inaccurate base call (e.g., G reported as A). Similarly, assigning a 'N'
(100% uncertainty whether A,C,G or T) to a base is a less serious problem
than failing to detect the base at all, or inserting an extra base. Table 1
summarizes the criteria used for the calculation of an accuracy score where,
starting from a score of 100%, arbitrary but balanced point values are
deducted for each sequencing error.
----------------------------------------------------------------------------
TABLE I: Criteria used for calculation of accuracy score
----------------------------------------------------------------------------
PENALTY INACCURACY
1 Inaccurate base call (e.g., G reorted as A or H (A,C, or T))
1 Deletion (failure to detect a base)
1 Insertion of an extra base
0.5 100% uncertainty (e.g., G reorted as N (A,C,G, or T))
0.375 75% uncertainty (e.g., G reported as V (A,C, or G))
0.25 50% uncertainty (e.g., G reported as S (C or G))
----------------------------------------------------------------------------
Accuracy is not a static parameter; It invariably declines as more bases are
read further away from the priming site. The slope of the sccuracy-curve
gives a better appreciation of the sequencing performance of the system
than one single value.
There are advantages in processing DNA sequences that are not 100% accurate
in their 3' parts. They can be useful in finding overlaps with other sub-
clones, or when interpreting the results of confirmatory or opposite strand
sequencing runs. However, this advantage steeply declines when more than 5
penalty points are accumulated over a 25 bp window, and a sequence should
be considered terminated at this point.
As an example, we assessed the accuracy of three automated sequencing
systems on their ability to sequence ten supercoiled dsDNA templates whose
sequences were known. The graphical representation of the evolution of the
mean accuracy scores of the three systems is shown in Figure 1.
(end)
Thank you for reading and commenting on this draft version !
Marc Van Ranst, PhD
Albert Einstein College of Medicine
Ullmann Building, Room 515
1300 Morris PArk Avenue
Bronx, New York, NY 10461
Tel.: 212-430 3744
Fax : 212-918 0857
E-mail : vanranst at aecom.yu.edu