sequence quality base by base error probability for base calling programs reflects assay bias (e.g. detection chemistry, algorithms) allows for more efficient sequence editing and assembly allows for poorly supervised automation
base calling: PHRED... Ewing et al. (1998), Ewing and Green (1998) the standard open base caller for ABI BigDye chemistry works with other chemistries also more ABI training data => best for ABI ABI s KB base caller is good (better), but closed source other base callers for other chemistries (e.g. LifeTrace) most algorithmic differences among programs are minor differences are mostly a result of different training data algorithms are empirically derived (i.e. kluges)
...base calling: PHRED... [1] calculate ideal peak locations assumes relatively even spacing chromatograms converted from log to linear [2] locate peaks in trace data [3] compare ideal and actual peaks (align) merge and split peaks based on ideal peaks call bases using signal intensity [4] call ambiguous bases near equal signal for multiple bases
PHRED: ideal peak locations... [1a] preliminary peaks for each dye color the maximum value between a pair of inflected points midpoint is used if there is no maximum must be 10% above previous peak (background) [1b] synthetic trace of preliminary peaks (all dye colors) height = 1, width = 1/4 local peak to peak distance [1c] sliding window: each peak ± 200 scans calculate mean scaled standard deviation of peak to peak distance <0.45 == good spacing
...PHRED: ideal peak locations [1d] select starting point window of lowest mean scaled standard deviation work right to end, then left to start [i] construct a damped synthetic trace at the current position Fourier transform the synthetic trace i.e. fit to a sin wave function [ii] if mean scaled standard deviation >0.45 => force average spacing; else: modify fit based on direction (left or right) and other kluges
PHRED: locate peaks [2a] for each dye color search original trace for concave regions sum florescence signal for each scan to estimate peak area (area under the curve) [2b] accept peaks that are at least 10% bigger than the average 10 previous peaks and 5% larger than the previous peak peak location == geometric center
PHRED: ideal vs. actual peaks [3a] align ideal and actual peaks similar to a sequence alignment algorithm [3b] call exact matches (highest intensity signal) [3c] call large shifted peaks (>0.2 relative average area) [3d] call small shifted peaks (>0.1 relative average area) [3e] remaining uncalled peaks are either called, or saved as best uncalled peak if no signal predominates, called as N
PHRED: call ambiguous bases [4] any peaks not assignably to a predicted peak are called provided: [4a] it is the strongest signal at a given scan [4b] >10% above background [4c] is unsplit (i.e. is just one peak) [4d] is flanked by called peaks [4e] adding the peak improves local peak spacing
error probabilities: PHRED calculates probabilities using a local window able to distinguish between good and bad regions not able to distinguish overall good from bad outputs log probabilities e.g. q = -10 log10(p) [p = 0.001; q = 30] predicts quality by measuring peak properties similar to linear discriminate analysis without assumption of normality (data are not normal)
sequence error probabilities
sequence error probabilities
PHRED: signs of error (a) peak spacing (7 peak window) (b) height of largest uncalled peak relative to smallest called peak (7 peak window) (c) height of largest uncalled peak relative to smallest called peak (3 peak window) (d) distance from the nearest unresolved base ( -1)
PHRED: threshold values need training set (i.e. resequence known regions) usually calculated from plasmid sequences not directly comparable to PCR products produce a lookup table for q = 1 50 compute empirical error rate for each parameter new sequence versus known sequence can be generated for any sequencing technology
contig quality measure the product of multiple sequencing reads determine if contig requires additional reads sum of positions above an arbitrary threshold normalized by the number of times each position should have been read (given the sequencing technology) used for DNA barcoding (Little 2010)
1.0 S R = jx i=1 0,Ri <q 1,R i q 0.8 0.6 S R P 0.4 0.2 0.0 0 20 40 60 80 100 sequence quality
1.0 S R = jx i=1 0,Ri <q 1,R i q B 30 0.8 0.6 0.4 B q = P k R=1 S R cx 0.2 0.0 0 20 40 60 80 100 percent high quality sequence orange = 1 coverage; blue = 2 coverage
Illumina base calling model based: AYB (Massingham and Goldman 2012), Bustard (Illumina default), BayesCall (Kao and Song 2009), naivebayescall (Kao and Song 2011), Onlinecall (Das and Vikalo 2012), Rolexa (Ledergerber and Dessimoz 2011), Softy (Das and Vikalo 2013), Swift (Whiteford et al. 2009), etc. (supervised) machine learning: Altacyclic (Erlich et al. 2008), freeibis (Renaud et al. 2013), Ibis (Kircher et al. 2009), etc.
Illumina base calling important parameters: cross talk among dyes phasing (i.e. secondary signals) as a function of cycle signal decay as a function of cycle intensity of the previous cycle intensity of the current cycle intensity of the next cycle
Illumina base calling Table 1. Accuracy for each basecaller on a Illumina GAIIx dataset (2 126 cycles with 366 135 257 clusters) Basecaller Training time Calling time Mapped (%) a Edit distance Bustard 583 348 201 (83.93%) 1.379 naivebayescall 591 h 658 h 578 957 145 (83.34%) 1.496 AYB 394 h 593 183 967 (85.52%) 1.076 Ibis 19.4 h 13.2 h 592 929 953 (85.31%) 1.167 freeibis 21.3 h 12.2 h 594 095 219 (85.48%) 1.145 The human sequences were mapped to the hg19 version of the human genome. The number of mapped sequences and the average number of mismatches for those were tallied for each method. Time trials were conducted on a machine with 74 GB of RAM and using 8 of the 12 Intel Xeon cores running at 2.27 GHz. a Percentage relative to sequences assigned to the read group of interest. (Renaud et al. 2013) (Das and Vikalo 2013) Table 1 Comparison of error rates and speed for GAII Decoding strategy Error rate Running times FB 0.0128 400mins SOVA 0.0129 300mins OnlineCall 0.0137 30mins naivebayescall 0.0139 1500mins Ibis 0.0147 480mins Bustard 0.0154 40mins Rolexa 0.0171 720mins Acomparisonoferrorratesandrunningtimes(perlane)fordifferent base callers (note that Bustard s running time is underestimated since it does not account for the parameter estimation step).
Illumina error probabilities model based: PA = IA/IA+IC+IG+IT (Whiteford et al. 2009) likelihood of the base call (Das and Vikalo 2012) (supervised) machine learning: SVM assignment scores converted to error probabilities using piecewise linear regression (Renaud et al. 2013)