Harvard HGDP-CEPH Genotypes for Population Genetics Analyses Supplement 10 This readme file briefly describes the contexts of the annotation.txt file as well as the 14 genotype data set. PLEASE ALSO SEE THE WARNINGS BELOW, BECAUSE THE DATASETS NEED TO BE USED PROPERLY TO TAKE ADVANTAGE OF THE UNIFORM ASCERTAINMENT. (1) annotation.txt: This file annotates every SNP for which genotypes of 934 HGDP individuals are reported. The meaning of each column is given in the comment lines at the top of this file. A value of '---' in any field of a row means that this information is not available for the SNP in that row. (2) 14 genotype datasets: Each of the 14 datasets has 3 files: sample_*.txt , *.ped, and *.map. sample_*.txt provides sample information. It has three columns - HGDP sample ID, gender, and population. For each dataset, the sample used in SNP ascertainment is given a special label, as this sample is expected to have very different genotype patterns from others in the same population, and should not be considered in population genetics analyses. Specifically, the identifier of the ascertainment samples is original-population_discover. For example, HGDP00542 in sample_panel3.txt is annotated like: "HGDP00542 M Papuan_discover". For sample_all_snp.txt, all 12 samples used in the SNP ascertainment are given as original-population_discover. The .ped and .map files are in PLINK format. All genotypes in the ped files are reported according to the plus strand of the PanTro2 assembly. (3) The "all_snp.*" files are special, and contain a merge of SNPs from all 13 panels, compatibility SNPs, chrY and mitochondria SNPs. There are important features of the "all_snp*" files that it is important to keep in mind: (a) The "all_snp.*" files are NOT APPROPRIATE for population genetic analyses that require uniform ascertainment. They are intended for analyses where the user wishes to maximize the number of SNPs available in an analysis where ascertainment bias is not a concern, such as an analysis of population substructure. (b) Coordinates are given according to the human reference sequence hg18, not PanTro2 as for the other files. (c) Only SNPs with hg18 chromosome values of 1-22, X, Y, or MT are reported. Chromosome X, Y and MT are labeled as 23, 24, and 26 respectively in the map files. (d) Every valid hg18 chromosome/position combination has a single entry in the final map file. (e) When one SNP has multiple probesets, the genotype of each sample is set to set to a no call when there are conflicting genotype calls across the probesets. (f) Tri-allelic SNPs only appeared when SNPs from all 13 panels were combined together. Since PLINK only allows bi-allelic SNPs, only homozygous genotypes for the "common" allele between the allele specifications are kept for concordant genotype calls when combining genotype calls for the same sample across multiple probesets. (g) In the final map file, the Affy_SNP_Id with alphabetically lower allele specification value is used if multiple Affy_SNP_Id mapped to the same hg18 chromosome/position. (4) For the panelN.* files: (a) Panel numbers range from 1 to 13. (b) As for the all_snps files, panTro2 chromosome values are coded numerically according to the PLINK convention. No attempt has been made to check whether chrX and chrY SNP positions for belong to pseudoautosomal regions. Both PanTro2 chr2a and chr2b are coded as chr2, and positions in chr2b are offset by 200,000,000. Summary of 14 data set Ascertainment Sample ID Panel Number of SNPs reported French HGDP00521 1 111,970 Han HGDP00778 2 78,253 Papuan1 HGDP00542 3 48,531 San HGDP01029 4 163,313 Yoruba HGDP00927 5 124,115 Mbuti HGDP00456 6 12,162 Karitiana HGDP00998 7 2,635 Sardinian HGDP00665 8 12,922 Melanesian HGDP00491 9 14,988 Cambodian HGDP00711 10 16,987 Mongolian HGDP01224 11 10,757 Papuan2 HGDP00551 12 12,117 Denisova-San Denisova-HGDP01029 13 151,435 --- --- all_snp 627,719 ======================================================================================================================= WARNINGS (From David Reich, Nick Patterson) We list here some obvious mistakes that can be made using this data. In preliminary work we have made most of these ourselves! 1. Ascertainment matters! For instance look in panels 4, 5 at mean allele "derived" frequency in San and Yoruba. We find Panel San Yoruba 4 .299 .262 5 .254 .299 We estimate standard errors here to be about .001, so these differences between the panels are due to ascertainment (San het for panel 4, Yoruba het for panel 5) 2. Our ascertainment (and genotyping) are not perfect. Many apparent hets from our low coverage sequence data are not real and some (we think small) number of snps may have also been genotyped incorrectly. 3. There will be some double mutations so that the Chimp allele is in fact the derived, not the ancestral allele. If this is important, looking at other primates may be helpful which is why data are included for Gorilla, Orang, Macaque and Marmoset. 4. The ancient DNA is of substantially worse quality than the main genotyped data. In the primary .ped files we give random alleles for Vindija Neandertals and the Denisova bone, but the error rate here is not accurately known. More detailed allele counts for reads covering the snps in the panel is available in the annotation file, and for some purposes this may be useful. 5. The sequencing work on Denisova and Vindija was not symmetric and the Vindija calls are of much worse quality. For some important questions it is critical to consider this. For instance we (David and Nick) presently believe that our African samples are symmetric between (Denisova, Neandertal). This would imply that the D-statistic (see Neandertal, Denisova papers): D(Chimp, San; Neandertal, Denisova) should be zero, in panel 4 (San ascertainment). We actually observe D = -.05 which corresponds to a Z-score of -5.4. We suspect that this is artefactual, induced by sequencing and alignment biases in the Neandertal samples. Please consider this possibility if such biases may affect your results.