|
|
home | your results | help |
| MARKER beta | Public browsing (anonymous) |
|
Haplotypic variation for a given population can theoretically be captured by genotyping a subset of informative SNPs termed haplotype tagging SNPs (htSNPs), which in combination are capable of describing all common haplotypes. Many different algorithms for haplotype tagging exist, but generally they fall into one of two fundamental approaches - the first is a block based approach, where contiguous SNPs are grouped into arbitrary blocks within which there is high correlation between SNPs and between which there is lower correlation between SNPs. Each block is then independently tagged. The second approach ignores any contiguous block structure and exploits SNP correlations across the whole region as a single unit. The latter approach has been shown to give rise to a smaller more cost effective htSNP set by several investigators. We have shown that although these sets may be more cost effective, they carry little if any redundancy of information and are therefore more susceptible to error when used to reconstruct haplotypes from genotype data. This is particularly noticeable in regions of low LD and when there is significant missing data in the genotype dataset. How can we exploit the efficiency of an unstructured approach where use of SNP correlations across the whole region is maximised while still ensuring an acceptable level of accuracy in haplotype inference? The error rate application uses simulation to predict error in haplotype inference. It provides an iterative method by which a minimal set of htSNPs derived using an unstructured tagging algorithm (e.g. Entropy) can be sequentially modified by adding further SNPs in regions where error susceptibility is high or where LD is low, until the error rates fall to an acceptable level. This process enables region specific and population specific assessment and correction of error susceptibility. In brief, the application randomly assigns haplotypes to a population of 500 individuals on the basis of population haplotype frequencies provided by the user. htSNPs are selected by the user, and htSNP genotype data is then generated for each individual in the population. SNPHAP is used to infer haplotypes from the htSNP genotype data and these haplotypes are then compared to the starting haplotypes assigned for each individual. An overall error rate for each locus on the haplotype can therefore be calculated. Simulations are repeated five times without missing data and five times with up to 20% missing data at each htSNP locus. The resultant errors (mean +/-SEM for each locus) are displayed as a haplotype error profile, in blue for simulations without missing data and in red for simulations with missing data. Selected htSNPs are highlighted in yellow. Pairwise LD statistics (r2) are aligned next to the simulated error profiles. Analyse your own haplotype data Because the calculations can be quite slow (at least of the order of minutes), it is necessary to register as a user and log in in order to vary the set of selected SNPs and recompute revised error rates. |
| Without missing data | With missing data |
| 15.55 +/- 0.73 % novel haplotypes | 25.82 +/- 4.03 % novel haplotypes |
|
The percentage error rate (mean +/- standard error) is shown below for runs without missing data (blue) and with missing data (red). The 25 selected SNPs are indicated on the graph with yellow highlighting. | |
|
|
|
|
|
The mean percentage error rates are listed below. |