Processing of HapMap genotypes
The MARKER team download genotypes from the HapMap project and, after some processing, load them into the MARKER database. Pairwise metrics are computed as described elsewhere, and the marker maps are annotated with gene definitions from Ensembl and EntrezGene.
At present we reject those markers which are found to be monomorphic, or tri-allelic, etc. We also discard those which have more than 10% missing data. The final test is whether a marker seems to be in Hardy-Weinberg equilibrium; but we do not apply this test to rare markers where the minor allele frequency q < 0.02, because this was found to be numerically unstable for typical sample sizes.
Data for the accepted markers is gathered together in datasets of typical size 200 markers. We move a sliding window across each chromosome's data, so the first dataset holds markers 1-200, the second 101-301, the third 201-401, and so on. This means that most markers appear twice: if they are near the edge of one dataset, they will be in the middle of the next dataset.