Simons Genome Diversity Project (SGDP)

Last update: Wed Mar 30 13:28:01 EDT 2022 (by Shop Mallick)

This points to data for 300 public SGDP samples across 142 diverse populations.

Update history:
Fri Mar 29 10:15:21 EDT 2024: minor edits, giving a link to the paper and a note about the reference genome used
Wed Mar 30 13:28:01 EDT 2022: minor change to signed-letter data access form
Tue May 4 13:02:27 EDT 2021: New imputed and phased version of the SGDP (created by Ali Akbari; details of the methodology will be provided in a manuscript in preparation), (see section E)
Wed Apr 15 07:04:27 EDT 2020: PLINK datasets updated to include .bed, .fam files (section I)
Wed Dec 18 01:14:09 EST 2019: updated metadata to include Fan et al samples added
Thu Oct 31 01:55:03 EDT 2019: variant set updated (see section I)
Fri Mar 1 08:45:22 EST 2019: direct ENA ftp pointers table added
Thu Feb 14 14:25:31 EST 2019: MT bams for public samples, along with coverage info
Wed Feb 6 13:07:07 EST 2019: updated: (a) info for pointer to complete signed letter samples vcfs, (b) clarification about construction of variant only vcfs
Tue Jan 15 03:47:46 EST 2019: bam indices for 300 samples added (see section AD; this should allow particular regions of bams from ENA to be downloaded )
Mon Dec 3 15:17:46 EST 2018: update links following system reorg
Thu Apr 26 16:57:22 EDT 2018: bug in phased data identified (see section E)
Mon Apr 9 09:03:21 EDT 2018: EGA signed release form updated
Wed Jan 24 14:56:58 EST 2018: Pointers to mappability filters (section F3)
Thu Jan 11 10:23:40 EST 2018: Fermikit contig pointer added (section D)
Mon Apr 3 17:39:13 EDT 2017: Template PDF for signed letter provided (section Z)
Sat Feb 18 00:16:43 EST 2017: 21 signed letter vcfs (variants only) made available directly by request (following signed letter agreement)
Thu Jan 12 15:21:12 EST 2016: 21 signed letter vcfs made available directly for easy access (following signed letter agreement)
Wed Dec 7 16:01:12 EST 2016: Variant only VCFs tar ball built: public samples directly available, pointers to signedLetter samples given
Oct 6 2016: bams for Y-chromosomes available
Oct 6 2016: phased data available

Latest version:

Version 2: The dataset has been reprocessed using the newer, more accurate "bwa mem" algorithm. Data are available in several forms:

(A) Samples, raw data, alignments, genotypes
* The paper describing this project is available here: sgdp_paper
* Samples are aligned to the hs37d5 reference sequence (which is hg19 with additional decoy sequences which improve alignments).
* Metadata for the samples is here: SGDP_metadata.279public.21signedLetter.44Fan.samples.txt
* Raw data for 279 genomes for which the informed consent documentation is consistent with fully public data release are available through the EBI European Nucleotide Archive under accession numbers PRJEB9586 and ERP010710.
( Direct ftp pointers for bams are here: ena.ftp.pointers.txt )

* For the remaining 21 genomes (designated by code Y in the seventh column of Supplementary Data Table 1), as well as 44 newly released genomes (as reported by Fan et al, Genome Biology 2019) data are deposited at the European Genome-phenome Archive (EGA), which is hosted by the EBI and the CRG, under accession number EGAS00001001959. See section (Z) below for instructions.

(AA1) BAMS for Y-chromosomes only

* Y-chromosome bams are separately available from here: link (This is 129Gb).

(AA2) BAMS for MT-chromosomes only

* MT-chromosome bams are separately available from here: link (This is 45Gb).
Coverage by sample is here: link (Mean coverage is 5344x).

(AB) VCFs of variants of publicly available samples

* 279 variant only vcfs are available from here: link (This is 57Gb).
Note: variant-only vcfs are built in the following way: each sample is genotyped using the Unified Genotyper as described in the main paper, and includes non-variant sites. Sites where an alternate allele is not present are ignored for variant-only vcfs. Each sample is considered independently.

We provide 263 C-panel fully public samples (as reported in Supplementary Information section 3) and 16 B-panel fully public samples for a total of 279 vcfs in a single tarball. Vcfs are bgzipped and provided with tbi files, with md5sums for both. Variant only vcfs are extracted from EMIT_ALL_SITES vcfs (which contain non-variant sites and are considerably bigger) which are intended to be available from EVA (validation pending). Note: These have been annotated with the results of Nick Patterson's filtering engine in the FL field, which should be helpful for variant analysis.

(AC) VCFs of variants of signed letter available samples

Variant only vcfs are available upon request and submission of signed letter (as described in (Z) Requestors of Signed Letter samples: instructions). Full vcfs (including non-variant sites are also available).

(AD) indices for 300 bam samples (both publicly available and signed letter samples)

* available here: link (This is 2.5Gb).
Note that access to the bams for the signed letter samples will still require a signed letter (see section (Z)).

(B) Compact: SGDP-lite and Ctools
Compact versions of the SGDP dataset and software for accessing it are available to give speedy access to variant/non-variant data in a dataset of only ~140Gb.

An important development to facilitate fast analysis of genomic data for large numbers of samples has been to push data into a novel format ("hetfa"), where both variant and non-variant sites are represented by an IUB encoding following genotyping. This enables extremely quick access to both single-site and region data. This has been further packaged into "SGDP-lite" (written by Nick Patterson), along with the results of a novel filtering engine to allow users to balance reliability of calls with coverage.

(B1) SGDP-lite
These data are available from here:
Latest version:
v3: link.

Previous versions:
v2: link.

(The size of SGDP-lite makes it possible for this to be quickly transferred, compared with the full dataset which is >70 Tbytes making it possibly impractical for some users both in terms of transfer costs and storage size).

(B2) Ctools
To accompany this, a powerful engine for population genetics manipulation of SGDP-lite, called "Ctools" (written by Nick Patterson and further developed by Mengyao Zhao), is available: https://github.com/DReichLab/cTools>

(C) STR genotypes
The short tandem repeat (STR) genotypes are available through dbVar under accession number nstd128 (http://www.ncbi.nlm.nih.gov/dbvar). If you simply wish to access variants (and non-variants) easily, SGDP-lite is recommended (see (a)).

(D) Fermikit
Described in SI4.
.. unitig variants are available from https://github.com/lh3/sgdp-fermi
.. contigs are available from https://www.ncbi.nlm.nih.gov/assembly/GCA_000786075.2/

(E) Phased genotypes
(E1) New phased dataset (built 2021):

• Created by Ali Akbari; details of the methodology will be provided in a manuscript in preparation,
• Samples are imputed against the thousand genomes project dataset (phase 3) [1000 Genomes Project Consortium],
• Initial mpileup is constructed using bcftools (version 1.10.2) [Danecek et al, GigaScience 2021], phasing is generated using Glimpse (version 1.0.0) [Rubinacci et al, Nature Genetics 2021].

The imputed file format is BCF with following fields:

Generated by imputation tool (glimpse):
    GT: Phased and imputed genotypes
    DS: Genotype dosage
    GP: Genotype posteriors

Generated by genotype caller (mpileup):
    PL: Phred-scaled genotype likelihoods
    AD: Allelic depths (high-quality bases)

(E1a) Public sample phased data are available from here: link.
(E1b) Signed letter sample phased data are available upon request and submission of signed letter (as described in section (Z)), from here: link.

(E2): Older phased dataset (built 2016, Dec):
... are available from: link (knownbugs, not recommended, please use newer dataset instead)

[Thu Apr 26 16:57:22 EDT 2018] Note: a bug in the processing chain indicates that heterozygous sites that occur in positions without a chimpanzee allele are incorrectly assigned homozygous reference state.
This artficially increases the size of some homozygous chunks and can affect some analysis; for example, it is known to inflate recent effective population sizes in MSMC estimates. We are working to rebuild these data.

(F,G,H) Filters and tools
(F1) Sample specific filters: bam2cnv:

... is available from: bam2cnv

(F2) Sample independent filters (for these samples)

... is available from: universal_mask

(F3) Generic mappability filters

For generating the map35_50% mask, please follow the procedure described here: http://lh3lh3.users.sourceforge.net/snpable.shtml

(G) vcf2hetfa:

... is available from: vcf2hetfa.pl

(H) filtstats:

... is available from (as part of the cTools package): filtstats

(I) variants

List of SNPs discovered as polymorphic in Simons Genome Diversity Project, combining the 300 individuals sequenced in Mallick et al. Nature 2016 with the 45 additionally reported sequences in Fan et al. Genome Biology 2019, and corresponding genotype/fam data (PLINK format): .bim (zip) (200Mb), .bed (2.8Gb), .fam (8.2Kb). Snps are filtered at 0.001 minor allele frequency (and sites require chimp allele to be present in panTro2).

(Z) Requestors of Signed Letter samples: instructions
BAMS: Data for these 21 genomes can be obtained by submitting to the EGA (European Genome-phenome Archive) Data Access Committee a dated, signed letter (in pdf form only) containing the following text: (a) I will not distribute the data outside my collaboration; (b) I will not post the data publicly; (c) I will make no attempt to connect the genetic data to personal identifiers for the samples; and (d) I will not use the data for any commercial purposes. A template PDF is here: link.
Please send BAM requests to: Michelle Lee (Michelle_Lee at hms.harvard.edu) or to EGA directly (accession number EGAS00001001959)

VCFs: VariantOnly: available on request, requiring a signed letter as described above. Please send VCF requests to: Michelle Lee (Michelle_Lee at hms.harvard.edu), copying Shop Mallick (shop at genetics.med.harvard.edu) . VCFs for public samples are available directly (see Section AB above).

Previous versions:

Version 1 (outdated):
First release of genotypes are available, built using the "bwa aln" algorithm from the Simons Foundation are here: http://www.simonsfoundation.org/life-sciences/simons-genome-diversity-project-dataset/ (includes some experimentally phased samples)

Dataset size is ~10Tb. Reprints and permissions information is available at www.nature.com/reprints. Correspondence and requests for materials should be addressed to S.M. (shop@genetics.med.harvard.edu) or D.R. (reich@genetics.med.harvard.edu).

Acknowledgements

This project was funded by the Simons Foundation.

References

[Fan et al. Genome Biology 2019]: African evolutionary history inferred from whole genome sequence data of 44 indigenous African populations. Fan S, Kelly DE, Beltrame MH, Hansen MEB, Mallick S, Ranciaro A, Hirbo J, Thompson S, Beggs W, Nyambo T, Omar SA, Meskel DW, Belay G, Froment A, Patterson N, Reich D, Tishkoff SA. Genome Biol. 2019.

[Danecek et al, GigaScience 2021]: Twelve years of SAMtools and BCFtools.. Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, O'Pollard M, Whitwham A, Keane T, McCarthy SA, Davies RM, Li G. GigaScience, Volume 10, Issue 2, February 2021, giab008, https://doi.org/10.1093/gigascience/giab008.

[Rubinacci et al, Nature Genetics 2021]: Efficient phasing and imputation of low-coverage sequencing data using large reference panels.Rubinacci S, Ribeiro D, Hofmeister R, Delaneau O.Nature Genetics 53.1 (2021): 120-126.

[1000 Genomes Project Consortium, Nature 2015]: A global reference for human genetic variation.1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR.Nature. 2015 Oct 1;526(7571):68-74. doi: 10.1038/nature15393. PMID: 26432245. http://www.internationalgenome.org/data/.